Data-driven Inverse Optimization with Imperfect Information
Peyman Mohajerin Esfahani, Soroosh Shafieezadeh-Abadeh, Grani Adiwena Hanasusanto, Daniel Kuhn
DData-driven Inverse Optimization with Imperfect Information
PEYMAN MOHAJERIN ESFAHANI, SOROOSH SHAFIEEZADEH-ABADEH,GRANI A. HANASUSANTO, AND DANIEL KUHN
Abstract.
In data-driven inverse optimization an observer aims to learn the preferences of an agent whosolves a parametric optimization problem depending on an exogenous signal. Thus, the observer seeks theagent’s objective function that best explains a historical sequence of signals and corresponding optimal actions.We focus here on situations where the observer has imperfect information, that is, where the agent’s trueobjective function is not contained in the search space of candidate objectives, where the agent suffers frombounded rationality or implementation errors, or where the observed signal-response pairs are corrupted bymeasurement noise. We formalize this inverse optimization problem as a distributionally robust programminimizing the worst-case risk that the predicted decision ( i.e. , the decision implied by a particular candidateobjective) differs from the agent’s actual response to a random signal. We show that our framework offersrigorous out-of-sample guarantees for different loss functions used to measure prediction errors and that theemerging inverse optimization problems can be exactly reformulated as (or safely approximated by) tractableconvex programs when a new suboptimality loss function is used. We show through extensive numerical teststhat the proposed distributionally robust approach to inverse optimization attains often better out-of-sampleperformance than the state-of-the-art approaches. Introduction
In inverse optimization an observer aims to learn the preferences of an agent who solves a parametricoptimization problem depending on an exogenous signal. The observer knows the constraints imposed on theagent’s actions but is unaware of her objective function. By monitoring a sequence of signals and correspondingactions, the observer seeks to identify an objective function that makes the observed actions optimal in theagent’s optimization problem. This learning problem can be cast as an inverse optimization problem overcandidate objective functions. The hope is that the solution of this inverse problem enables the observer topredict the agent’s future actions in response to new signals.Inverse optimization has a wide spectrum of applications spanning several disciplines ranging from econo-metrics and operations research to engineering and biology. For example, a marketing executive aims tounderstand the purchasing behavior of consumers with unknown utility functions by monitoring sales fig-ures [1, 11, 7], a transportation planner wishes to learn the route choice preferences of the passengers in amultimodal transport system by measuring traffic flows [3, 15, 17, 20], or a healthcare manager seeks to designclinically acceptable treatments in view of historical treatment plans [19]. It is even believed that the behav-ior of many biological systems is governed by a principle of optimality with respect to an unknown decisioncriterion, which can be inferred by tracking the system [13, 38]. Inverse optimization has also been applied ingeoscience [32, 41], portfolio selection [10, 26], production planning [39], inventory management [18], networkdesign and control [3, 17, 25, 20] and the analysis of electricity prices [34].The main thrust of the early literature on inverse optimization is to identify an objective function thatexplains a single observation. In the seminal paper [4] the agent solves a static (non-parametric) linearprogram and reveals her optimal decision to the observer, who then identifies the objective function closest
Date : July 25, 2017.The authors are with the Delft Center for Systems and Control, TU Delft, The Netherlands(
[email protected] ), the Risk Analytics and Optimization Chair, EPFL, Switzerland( { soroosh.shafiee,daniel.kuhn } @epfl.ch ) and the Graduate Program in Operations Research and Industrial Engineer-ing, UT Austin, USA ( [email protected] ). a r X i v : . [ m a t h . O C ] J u l P. MOHAJERIN ESFAHANI, S. SHAFIEEZADEH-ABADEH, G.A. HANASUSANTO, D. KUHN to a prescribed nominal objective, under which the observed decision is optimal. This model was laterextended to conic programs [26], integer programs [2, 24, 35, 40] and linearly constrained separable convexprograms [42]. Another variant of this problem is considered in [2], where the observer identifies an admissibleobjective function for which the optimal value of the agent’s problem is closest to the observed optimal valuecorresponding to the unknown true objective.This paper focuses on data-driven inverse optimization problems where the agent solves a parametricoptimization problem several times . Accordingly, the observer has access to a finite sequence of signals andcorresponding optimal responses. Using this training data, the observer aims to infer an objective functionthat accurately predicts the agent’s optimal responses to unseen future signals. As in classical regression, thislearning task could be addressed by minimizing an empirical loss that penalizes the mismatch between thepredicted and true optimal responses to a given signal.
Data-driven inverse optimization problems of this typehave only just started to attract attention, and to the best of our knowledge there are currently only threepapers that study such problems. In [27] the observer seeks an objective function under which all observeddecisions solve the Karush-Kuhn-Tucker (KKT) optimality conditions of the agent’s convex optimizationproblem. To this end, the observer minimizes some norm of the KKT residuals at all observations. Asimilar goal is pursued in [11], where the optimality conditions are expressed via variational inequalitiesthat can be reformulated as tractable conic constraints using ideas from robust optimization. This approachhas the additional benefit that it extends to more general inverse equilibrium problems, which indicatesthat inverse optimization problems constitute special instances of mathematical programs with equilibriumconstraints. A comprehensive survey of variational inequalities and mathematical programs with equilibriumconstraints is provided in [23]. The third paper suggests to minimize the empirical average of the squaredEuclidean distances between the predicted and true observed decisions, in which case the data-driven inverseoptimization problem reduces to a bilevel program [6].In summary, all existing approaches to data-driven inverse optimization solve an empirical loss minimizationproblem over some search space of candidate objectives. Different approaches mainly differ with respect tothe loss functions that capture the mismatch between predictions and observations. The
KKT loss usedin [27] quantifies the extent to which the observed response to some signal violates the KKT conditions for afixed candidate objective. Similarly, the first-order loss used in [11] measures the extent to which an observedresponse violates the first-order optimality conditions. Moreover, the predictability loss used in [6] capturesthe squared distance between an observed response and the response predicted by a given candidate objective.In this paper we introduce the new suboptimality loss , which quantifies the degree of suboptimality of anobserved response under a given candidate objective. We highlight that the predictability and suboptimalitylosses both enjoy a direct physical meaning, while the KKT and first-order losses are not as easily interpretable.Computational experiments in [27] and [11] suggest that empirical loss minimization problems under perfectinformation are likely to correctly identify the agent’s true objective function if there is sufficient trainingdata and the search space of candidate objectives is not too large. In any realistic setting, however, theobserver is confronted with imperfect information such as model uncertainty (the agent’s true objective is notone of the candidate objectives), noisy measurements (the observed signals and responses are corrupted bymeasurement errors) or bounded rationality (the agent settles for suboptimal responses due to cognitive orcomputational limitations). Due to overfitting effects, imperfect information can severely impair the predictivepower of a candidate objective obtained via empirical loss minimization. This is simply a manifestation of thenotorious ‘garbage in-garbage out’ phenomenon. As imperfect information certainly reflects the norm ratherthan the exception in inverse optimization, we propose here a systematic approach to combat overfitting viadistributionally robust optimization. Specifically, inspired by [30] and [36], we use the Wasserstein distance toconstruct a ball in the space of all signal-response-distributions centered at the empirical distribution on thetraining samples, and we formulate a distributionally robust inverse optimization problem that minimizes theworst-case risk of loss for any combination of a risk measure with a loss function, where the worst case is takenover all distributions in the Wasserstein ball. If the radius of the Wasserstein ball is chosen judiciously, we
ATA-DRIVEN INVERSE OPTIMIZATION WITH IMPERFECT INFORMATION 3 can guarantee that it contains the unknown data-generating distribution with high confidence, which in turnallows us to derive rigorous out-of-sample guarantees for the risk of loss of unseen future observations. Theproposed distributionally robust inverse optimization problem can naturally be interpreted as a regularizationof the corresponding empirical loss minimization problem. While regularization is known to improve the out-of-sample performance of numerous estimators in statistics, it has not yet been investigated systematically inthe context of data-driven inverse optimization.We highlight the following main contributions of this paper relative to the existing literature: • We propose the suboptimality loss as an alternative to the KKT, first-order and predictability losses.The suboptimality loss admits a direct physical interpretation (like the predictability loss) and leadsto convex empirical loss minimization problems (like the KKT and first-order losses) whenever thecandidate objective functions admit a linear parameterization. In contrast, empirical predictabilityloss minimzation problems constitute NP-hard bilevel programs even for linear candidate objectives.We also propose the bounded rationality loss, which generalizes the suboptimality loss to situationswhere the agent is known to select δ -suboptimal decision due to bounded rationality. • We leverage the data-driven distributionally robust optimization scheme with Wasserstein balls de-veloped in [30] to regularize empirical inverse optimization problems under imperfect information.As such, the proposed approach offers out-of-sample guarantees for any combination of risk measuresand loss functions. In contrast, [11] develops out-of-sample guarantees only for the value-at-risk ofthe first-order loss, while [27] and [6] discuss no (finite) out-of-sample guarantees at all. • We study the tractability properties of the distributionally robust inverse optimization problem thatminimizes the conditional value-at-risk of the suboptimality loss. We prove that this problem isequivalent to a convex program when the search space consists of all linear functions. We also showthat this problem admits a safe convex approximation when the search space consists of all convexquadratic functions or all conic combinations of finitely many convex basis functions. • We argue that the first-order and suboptimality losses can be used as tractable approximations forthe intractable predictability loss, which has desirable statistical consistency properties [6] and is thepreferred loss function if the observer aims for prediction accuracy. We show that if the candidateobjective functions are strongly convex, then the estimators obtained from minimizing the first-orderand suboptimality losses admit out-of-sample predictability guarantees. Moreover, the predictabilityguarantee corresponding to the suboptimality loss is stronger than the one obtained from the first-order loss. Recall that the predictability loss itself cannot be minimized in polynomial time. • We show through extensive numerical tests that the proposed distributionally robust approach toinverse optimization attains often better (lower) out-of-sample suboptimality and predictability thanthe state-of-the art approaches in [11] and [6]. All of our experiments are reproducible, and the un-derlying source codes are available at https://github.com/sorooshafiee/InverseOptimization .The rest of the paper develops as follows. In Sections 2 and 3 we formalizes the inverse optimizationproblem under perfect and imperfect information, respectively. Section 4 then introduces the distributionallyrobust approach to inverse optimization, while Sections 5 and 6 derive tractable reformulations and safeapproximations for distributionally robust inverse optimization problems over search spaces of linear andquadratic candidate objectives, respectively. Numerical results are reported in Section 7.
Notation.
The inner product of two vectors s, t ∈ R m is denoted by (cid:10) s, t (cid:11) := s (cid:124) t , and the dual of a norm (cid:107) · (cid:107) on R m is defined through (cid:107) t (cid:107) ∗ := sup (cid:107) s (cid:107)≤ (cid:10) t, s (cid:11) . The dual of a proper (closed, solid, pointed) convexcone C ⊆ R m is defined as C ∗ := { t ∈ R m : (cid:10) t, s (cid:11) ≥ ∀ s ∈ C} , and the relation s (cid:23) C t is interpreted as s − t ∈ C . Similarly, for two symmetric matrices Q, R ∈ R m × m the relation Q (cid:23) R ( Q (cid:22) R ) means that Q − R is positive (negative) semidefinite. The identity matrix is denoted by I . We denote by δ ξ the Diracdistribution concentrating unit mass at ξ ∈ Ξ. The N -fold product of a distribution P on Ξ is denoted by P N , which represents a distribution on the Cartesian product Ξ N . P. MOHAJERIN ESFAHANI, S. SHAFIEEZADEH-ABADEH, G.A. HANASUSANTO, D. KUHN Inverse Optimization under Perfect Information
Consider an agent who first receives a random signal s ∈ S ⊆ R m and then solves the following parametricoptimization problem: minimize x ∈ X ( s ) F ( s, x ) . (1)Note that both the objective function F : R m × R n → R as well as the (multivalued) feasible set mapping X : R m ⇒ R n depend on the signal. We assume that the set of minimizers X (cid:63) ( s ) := arg min x ∈ X ( s ) F ( s, x ) isnon-empty for every s ∈ S . Consider also an independent observer who monitors the signal s ∈ S as well asthe agent’s optimal response x ∈ X (cid:63) ( s ). We assume that the observer is ignorant of the agent’s preferencesencoded by the objective function F . Thus, a priori , the observer cannot predict the agent’s response x toa particular signal s . Throughout the paper we assume that the observed signal-response pairs ξ := ( s, x )are governed by some probability distribution P supported on Ξ := (cid:8) ( s, x ) : s ∈ S , x ∈ X ( s ) (cid:9) , which can beviewed as the graph of the feasible set mapping X . Note that the marginal distribution of s under P capturesthe frequency of the exogenous signals, while the conditional distribution of x given s places all probabilitymass on the argmin set X (cid:63) ( s ). Note that, unless X (cid:63) ( s ) is a singleton, the exact conditional distribution of x given s depends on the specific optimization algorithm used by the agent.In the following we assume that the observer has access to N independent samples (cid:98) ξ i := ( (cid:98) s i , (cid:98) x i ) from P ,which can be used to learn the agent’s objective function. As the space of all possible objective functions isvast, the observer seeks to approximate F by some candidate objective function from within a parametric hypothesis space F = { F θ : θ ∈ Θ } , where Θ represents a finite-dimensional parameter set. Ideally, theobserver would aim to identify the hypothesis F θ closest to F , e.g. , by solving the least squares problemminimize θ ∈ Θ N N (cid:88) i =1 (cid:96) θ ( (cid:98) s i , (cid:98) x i ) , (2)where (cid:96) θ ( (cid:98) s i , (cid:98) x i ) denotes the identifiability loss as per the following definition. Definition 2.1 (Identifiability loss) . The identifiability loss of model θ is given by (cid:96) θ ( s, x ) := | F ( s, x ) − F θ ( s, x ) | . (3)Unfortunately, the identifiability loss of a training sample cannot be evaluated unless the agent’s objectivefunction F is known. Indeed, the observer is blind to the agent’s objective values F ( (cid:98) s i , (cid:98) x i ) and only sees thesignals (cid:98) s i and responses (cid:98) x i . Thus, the identifiability loss cannot be used to learn F . It can merely be usedto assess the quality of a hypothesis F θ obtained with another method in a synthetic experiment where thetrue objective F is known. As two objective functions have the same minimizers whenever they are relatedthrough a strictly monotonically increasing transformation, however, it is indeed fundamentally impossibleto learn F from the available training data. At best we can learn the set of its minimizers X (cid:63) ( s ) for every s .If a hypothesis F θ is used in lieu of F , it can be used to predict the agent’s optimal response to a signal s by solving a variant of problem (1), where F is replaced with F θ . In the following we define X (cid:63)θ ( s ) :=arg min y ∈ X ( s ) F θ ( s, y ) and refer to any x ∈ X (cid:63)θ ( s ) as a response to s predicted by θ . Note that x ∈ X (cid:63)θ ( s ) ifand only if the response x to s can be explained by model θ . In order to assess the quality of a candidatemodel θ , the observer should now check whether X (cid:63)θ ( s ) ≈ X (cid:63) ( s ) with high probability over s ∈ S . This can beachieved by solving an empirical loss minimization problem of the form (2) with a loss function that satisfies (cid:96) θ ( s, x ) = 0 if x ∈ X (cid:63)θ ( s ) and (cid:96) θ ( s, x ) > x can be explained as an optimal response to s under model θ .In order to learn X (cid:63) ( s ), an intuitive approach is to minimize the predictability loss defined below. Definition 2.2 (Predictability loss) . The predictability loss of model θ is given by (cid:96) θ ( s, x ) := min y ∈ X θ ( s ) (cid:107) x − y (cid:107) . (4a) ATA-DRIVEN INVERSE OPTIMIZATION WITH IMPERFECT INFORMATION 5
It quantifies the squared Euclidean distance of x from the set of responses to s predicted by θ . The predictability loss is known to offer strong statistical consistency guarantees but renders (2) an NP-hard bilevel optimization problem even if the agent’s subproblem is convex [6]. Thus, the predictability losscan only be used for low-dimensional problems involving moderate sample sizes. An alternative choice is tominimize the suboptimality loss proposed in this paper, which we will show to be computationally attractive.
Definition 2.3 (Suboptimality loss) . The suboptimality loss of model θ is given by (cid:96) θ ( s, x ) := F θ ( s, x ) − min y ∈ X ( s ) F θ ( s, y ) . (4b) It quantifies the suboptimality of x with respect to F θ given the signal s . Another computationally attractive loss function is the degree of violation of the agent’s first-order opti-mality condition [11].
Definition 2.4 (First-order loss) . If F θ is differentiable with respect to x , the first-order loss is given by (cid:96) θ ( s, x ) := max y ∈ X ( s ) (cid:10) ∇ x F θ ( s, x ) , x − y (cid:11) . (4c) It quantifies the extent to which x violates the first-order optimality condition of the optimization problem (1) for a given s , where F is replaced with F θ . Note that the first-order loss vanishes whenever x represents alocal minimizer of F θ ( s, · ) over X ( s ) . Note that the predictability loss best captures the observer’s objective to predict the agent’s decisions.However, the suboptimality loss and the first-order loss have better computational properties. Indeed, we willargue below that the learning model (2) with the loss functions (4b) or (4c) is computationally tractable undersuitable convexity assumptions about the agent’s decision problem (1), the support set Ξ and the hypothesisspace F . Thus, we encounter a similar situation as in binary classification, where it is preferable to minimizethe convex hinge loss instead of the discontinuous 0-1 loss, which is the actual quantity of interest.The following proposition establishes basic properties of the loss functions (4). Proposition 2.5 (Dominance relations between loss functions) . Assume that F θ ( s, x ) is convex and differ-entiable in x , and define γ ≥ as the largest number satisfying the inequality F θ ( s, y ) − F θ ( s, x ) ≥ (cid:10) ∇ x F θ ( s, x ) , y − x (cid:11) + γ (cid:107) y − x (cid:107) ∀ x, y ∈ X ( s ) , ∀ s ∈ S . (5) If (cid:96) p θ , (cid:96) s θ and (cid:96) f θ denote the predictability, suboptimality and first-order losses, respectively, then we have (cid:96) f θ ( s, x ) ≥ (cid:96) s θ ( s, x ) ≥ γ (cid:96) p θ ( s, x ) ∀ s ∈ S , x ∈ X ( s ) . (6) Moreover, all three loss functions are non-negative and evaluate to zero if and only if x ∈ X θ ( s ) . Note that (5) always holds for γ = 0 due to the first-order condition of convexity [14, Section 3.1.3]. Proof of Proposition 2.5.
Setting γ = 0 and minimizing both sides of (5) over y ∈ X ( s ) yields (cid:96) f θ ( s, x ) ≥ (cid:96) s θ ( s, x ). Next, the first-order optimality condition of the convex program (1) with objective function F θ requires that (cid:10) ∇ x F θ ( s, x ) , y − x (cid:11) ≥ ∀ y ∈ X ( s )(7)at any optimal point x ∈ X (cid:63)θ ( s ). Combining the inequalities (5) and (7) then yields F θ ( s, y ) − F θ ( s, x ) ≥ γ (cid:107) y − x (cid:107) ∀ x ∈ X (cid:63)θ ( s ) , y ∈ X ( s ) . Minimizing both sides of the above inequality over x ∈ X (cid:63)θ ( s ) yields (cid:96) s θ ( s, x ) ≥ γ (cid:96) p θ ( s, x ). Note that thisinequality is only useful for γ >
0, in which case X (cid:63)θ ( s ) is in fact a singleton. It is straightforward to verifythat all loss functions (4) are non-negative and evaluate to zero if and only if x ∈ X (cid:63)θ ( s ). In the case of the P. MOHAJERIN ESFAHANI, S. SHAFIEEZADEH-ABADEH, G.A. HANASUSANTO, D. KUHN first-order loss, for instance, this equivalence holds because the first-order condition (7) is both necessary andsufficient for the optimality of x . We remark that (6) remains valid if γ depends on s and θ . (cid:3) While the basic estimation models in statistical learning all minimize an empirical loss as in (2), some inverseoptimization models proposed in [11] implicitly minimize the worst-case loss across all training samples. Tocapture both approaches in a unified model, we suggest here to minimize a normalized, positive homogeneousand monotone risk measure ρ that penalizes positive losses. More precisely, we denote by ρ Q ( (cid:96) θ ) the risk ofthe loss (cid:96) θ ( ξ ) if ξ = ( s, x ) follows the distribution Q . The inverse optimization problem (2) thus generalizes tominimize θ ∈ Θ ρ (cid:98) P N ( (cid:96) θ ) , where (cid:98) P N := 1 N N (cid:88) i =1 δ (cid:98) ξ i (8)represents the empirical distribution on the training samples. In the remainder we refer to ρ (cid:98) P N ( (cid:96) θ ) as the empirical or in-sample risk . Note that (8) reduces indeed to (2) if we choose the expected value as the riskmeasure. Two alternative risk measures that could be used in (8) are described in the following example. Example . A popular risk measure that the observer could use to quantify the risk of apositive loss is the conditional value-at-risk (CVaR) at level α ∈ (0 , Q α ( (cid:96) θ ) = inf τ τ + 1 α E Q (cid:2) max { (cid:96) θ ( s, x ) − τ, } (cid:3) , (9a)see [33]. For α = 1, the CVaR reduces to the expected value, and for α ↓
0, it converges to the essentialsupremum of the loss. Alternatively, the observer could use the value-at-risk (VaR) at level α ∈ [0 ,
1] defined asVaR Q α ( (cid:96) θ ) = inf τ (cid:8) τ : Q (cid:2) (cid:96) θ ( s, x ) ≤ τ (cid:3) ≥ − α (cid:9) . (9b)Note that the VaR coincides with the upper (1 − α )-quantile of the loss distribution. Moreover, if (cid:96) θ ( s, x ) has acontinuous marginal distribution under Q , then CVaR Q α ( (cid:96) θ ) coincides with the expected loss above VaR Q α ( (cid:96) θ ).By definition, the loss function (cid:96) θ ( ξ ) is non-negative for all ξ ∈ Ξ and θ ∈ Θ. The monotonicity andnormalization of the risk measure ρ thus imply that ρ (cid:98) P N ( (cid:96) θ ) ≥ ρ (cid:98) P N (0) = 0 ∀ θ ∈ Θ , which in turn implies that the optimal value of problem (8) is necessarily larger than or equal to zero.Moreover, if the agent’s true objective function is contained in F , that is, if F = F θ (cid:63) for some θ (cid:63) ∈ Θ, thenthe loss (cid:96) θ (cid:63) ( (cid:98) ξ i ) vanishes for all i , indicating that the optimal value of (8) is zero and that θ (cid:63) is optimal in (8).In this case, the optimal value of the in-sample risk minimization problem (8) is known a priori , and the onlyinformative output of any solution scheme is an optimizer , that is, a model θ (cid:48) ∈ Θ with zero in-sample risk. Ifthe number of training samples is moderate, then there may be multiple optimal solutions, and θ (cid:48) may differfrom the agent’s true model θ (cid:63) . Remark 2.6 (Choice of risk measures) . If F = F θ (cid:63) for θ (cid:63) ∈ Θ , then the minimum of (8) vanishes and isminimized by θ (cid:63) irrespective of ρ . Thus, one might believe that the choice of the risk measure is immaterialfor the inverse optimization problem. However, different risk measures may result in different solution sets.For example, if ρ is the CVaR at level α ∈ (0 , , then θ (cid:48) is a minimizer of (8) if and only if (cid:96) θ (cid:48) ( (cid:98) s i , (cid:98) x i ) = 0 for all i ≤ N . In contrast, if ρ is the VaR at level α ∈ [0 , , then θ (cid:48) is a minimizer of (8) if and only if (cid:96) θ (cid:48) ( (cid:98) s i , (cid:98) x i ) = 0 for a portion of at least − α of all N training samples. Thus, the use of VaR may lead to aninflated solution set. Moreover, the choice of ρ impacts the tractability of (8) ; see Sections 5 and 6. Inverse Optimization under Imperfect Information
The proposed framework for inverse optimization described in Section 2 is predicated on the assumptionof perfect information. Specifically, it is assumed that the agent is able to determine and implement the bestresponse x to any given signal s and that the observer can measure s and x precisely. Moreover, it is implicitly ATA-DRIVEN INVERSE OPTIMIZATION WITH IMPERFECT INFORMATION 7 assumed that the family F of candidate objective functions contains the agent’s true objective function F .In practice, however, the observer may be confronted with the following challenges:(i) Model uncertainty:
The hypothesis space F chosen by the observer may not be rich enough tocontain the agent’s true objective function F or any strictly increasing transformation of F thatencodes the same preferences.(ii) Measurement noise:
The observed signal-response pairs may be corrupted by noise, which preventsthe observer from measuring them exactly.(iii)
Bounded rationality:
The agent may settle for a suboptimal decision x due to cognitive or com-putational limitations. Even if the best response can be computed exactly, an exact implementationof the desired best response may not be possible due to implementation errors [9].In the presence of model uncertainty , that is, if neither F nor any strictly increasing transformation of F iscontained in the chosen hypothesis space F , then there exists typically no θ ∈ Θ such that the loss functionsdescribed in Section 2 vanish on all training samples. In this case a perfect recovery of the agent’s preferencesis fundamentally impossible, and the optimal value of the empirical risk minimization problem (8) is positive.The best the observer can hope for is to learn the parametric hypothesis that most accurately (but imperfectly)captures the agent’s true preferences.In the presence of measurement noise the observed training samples (cid:98) ξ i = ( (cid:98) s i , (cid:98) x i ) represent random per-turbations of some unobservable pure samples (cid:101) ξ i = ( (cid:101) s i , (cid:101) x i ). While (cid:101) x i is an exact optimal response to anunperturbed signal (cid:101) s i , the noisy response (cid:98) x i generically constitutes a suboptimal—maybe even infeasible—response to (cid:98) s i . We will henceforth assume that the noisy samples (cid:98) ξ i are mutually independent and follow an in-sample distribution P in , while the corresponding unperturbed samples (cid:98) ξ i are governed by an out-of-sampledistribution P out . While the distribution P out of the perfect samples is supported on Ξ by construction, thein-sample distribution P in of the noisy samples may or may not be supported on Ξ. If the noisy samples canmaterialize outside of Ξ, then we call the noise inconsistent . If all noisy samples are guaranteed to residewithin Ξ, on the other hand, then we call the noise consistent . Thus, in the presence of measurement noise,the observer faces the challenging task to learn P out from samples of P in . Note that in the absence of noisewe have P in = P out , which coincides with the distribution P introduced in Section 2.Agents suffering from bounded rationality may not be able to solve (1) to global optimality. Such agentsmay respond to a given signal s with a δ -suboptimal solution x δ , that is, a decision x δ ∈ X ( s ) with F ( s, x δ ) ≤ min x ∈ X ( s ) F ( s, x ) + δ . Here, the parameter δ ≥ δ >
0, theagent accepts a cost increase of up to δ for the freedom of choosing a δ -suboptimal decision, which mayrequire fewer cognitive or computational resources than finding a global minimizer. Note that the observermay mistakenly perceive the effects of bounded rationality as (consistent) measurement noise or vice versa.So one might argue that there is no need to distinguish between these two types of imperfect information.However, bounded rationality is fundamentally different from measurement noise if the observer knows that theimperfect measurements are caused by bounded rationality. If the imperfections originate from measurementnoise, then the observer aims to filter out the noise in order to predict the agent’s pure decisions. If theimperfections originate from bounded rationality, on the other hand, then the observer aims to predict theagent’s δ -suboptimal decisions and not the ideal global optimizers. In other words, the observer always aimsto predict the agent’s responses bar measurement noise, irrespective of whether these responses are rationalor not. An observer who knows the agent’s degree of irrationality δ may thus improve the predictive powerof her learning model by replacing the suboptimality loss (4b) with the bounded rationality loss. Definition 3.1 (Bounded rationality loss) . The bounded rationality loss of model θ is given by (cid:96) θ ( δ ; s, x ) := max (cid:8) F θ ( s, x ) − min y ∈ X ( s ) F θ ( s, y ) − δ, (cid:9) . (10) P. MOHAJERIN ESFAHANI, S. SHAFIEEZADEH-ABADEH, G.A. HANASUSANTO, D. KUHN
It quantifies the amount by which the suboptimality of x with respect to F θ given signal s exceeds δ ≥ . Byconstruction, (cid:96) θ ( δ ; s, x ) = 0 whenever x is a δ -supoptimal response to the signal s , and (cid:96) θ ( s, x ) > otherwise. Note that the bounded rationality loss with δ = 0 reduces to the usual suboptimality loss (4b). Evenif it is known that the agent suffers from bounded rationality, the constant δ is unlikely to be available inpractice. However, as long as there are no other sources of imperfect information (such as model uncertaintyor measurement noise), the observer can jointly estimate δ and θ from the training data by solving(11) minimize θ ∈ Θ , δ ≥ { δ : (cid:96) θ ( δ ; (cid:98) s i , (cid:98) x i ) = 0 ∀ i ≤ N } . This variant of the inverse optimization problem identifies the smallest bounded rationality constant δ thatexplains all observed responses as δ -suboptimal decisions under model θ . Note that (11) constitutes a convexoptimization problem as long as Θ is convex and the hypotheses F θ ( s, x ) are affinely parameterized in θ , whichguarantees that (cid:96) θ ( δ ; s, x ) is jointly convex in θ and δ for any fixed s and x . Moreover, instead of solving (11),one could equivalently solve the empirical risk minimization problem (8) with the suboptimality loss and theessential supremum risk measure to find θ and then set δ to the resulting optimal objective value.To some extent, all of the complications discussed in this section are inevitable in any real application. Infact, they are likely to reflect the norm rather than the exception. Thus, the main objective of this paper isto develop inverse optimization models that can cope with imperfect information in a principled manner.4. Distributionally Robust Inverse Optimization
By combining different loss functions with different risk measures, one may synthesize different empiricalrisk minimization problems of the form (8). By construction, any solution of (8) has minimum in-samplerisk. However, the in-sample risk merely captures historical performance and is therefore of little practicalinterest. Instead, the observer seeks models that display promising performance on unseen future data.
Definition 4.1 (Out-of-sample risk) . We refer to ρ P out ( (cid:96) θ ) as the out-of-sample risk of the model θ ∈ Θ . Ideally, the observer would want to minimize the out-of-sample risk over all candidate models θ ∈ Θ. Thisis impossible, however, because the out-of-sample distribution P out of the signal-response pairs is unknown,and only a finite set of training samples from P in is available (recall that P in = P out if measurementsare perfect). In this situation, the observer has to settle for a data-driven solution (cid:98) θ N ∈ Θ that dependson the training samples and attains—hopefully—a low out-of-sample risk. We emphasize that, due to itsdependence on the training samples, (cid:98) θ N constitutes a random object, whose stochastics is governed by theproduct distribution P N in . A simple data-driven solution is obtained, for instance, by solving the empiricalrisk minimization problem (8).While it is impossible to minimize the out-of-sample risk on the basis of the training samples, it is sometimespossible to establish data-driven out-of-sample guarantees in the sense of the following definition. Definition 4.2 (Out-of-sample guarantee) . We say that a data-driven solution (cid:98) θ N enjoys an out-of-sampleperformance guarantee at significance level β ∈ [0 , if there exists a data-driven certificate (cid:98) J N with P N in (cid:2) ρ P out ( (cid:96) (cid:98) θ N ) ≤ (cid:98) J N (cid:3) ≥ − β. (12)Note that the probability in (12) is evaluated with respect to the distribution of N independent (potentiallynoisy) training samples, which impact both the data-driven solution (cid:98) θ N and the certificate (cid:98) J N . Note alsothat the certificate (cid:98) J N can be viewed as an upper (1 − β )-confidence bound on the out-of-sample risk of (cid:98) θ N .Thus, we sometimes refer to the confidence level 1 − β as the certificate’s reliability .As the ideal goal to minimize the out-of-sample risk is unachievable, the observer might settle for the moremodest goal to determine a data-driven solution that admits a low certificate with a high reliablity. We willnow argue that this secondary goal is achievable by adopting a distributionally robust approach. Specifically,we will use the N training samples to design an ambiguity set (cid:98) P N that contains the out-of-sample distribution ATA-DRIVEN INVERSE OPTIMIZATION WITH IMPERFECT INFORMATION 9 P out of the (perfect) signal-response pairs with confidence 1 − β . Next, we construct the data-driven solution (cid:98) θ N and the corresponding certificate (cid:98) J N by minimizing the worst-case risk across all models θ ∈ Θ, where theworst case is taken with respect to all signal-response distributions in the ambiguity set (cid:98) P N , that is, we set (cid:98) θ N ∈ arg min θ ∈ Θ sup Q ∈ (cid:98) P N ρ Q ( (cid:96) θ ) and (cid:98) J N := min θ ∈ Θ sup Q ∈ (cid:98) P N ρ Q ( (cid:96) θ ) . (13)It is clear that if P N in [ P out ∈ (cid:98) P N ] ≥ − β , then the distributionally robust solution (cid:98) θ N and the correspondingcertificate (cid:98) J N defined above satisfy the out-of-sample guarantee (12). In order to ensure that (cid:98) P N containsthe unknown out-of-sample distribution P out with confidence 1 − β , we construct the ambiguity set (cid:98) P N as aball in the space of probability distributions with respect to the Wasserstein metric as suggested in [30]. Definition 4.3 (Wasserstein metric) . For any integer p ≥ and closed set Ξ ⊂ R n + m we let M p (Ξ) bethe space of all probability distributions Q supported on Ξ with E Q (cid:2) (cid:107) ξ (cid:107) p (cid:3) = (cid:82) Ξ (cid:107) ξ (cid:107) p Q (d ξ ) < ∞ . The p -Wasserstein distance between two distributions Q , Q ∈ M p ( R n + m ) is defined as W p (cid:0) Q , Q (cid:1) := inf (cid:40)(cid:18)(cid:90) (cid:107) ξ − ξ (cid:107) p Π(d ξ , d ξ ) (cid:19) /p : Π is a joint distribution of ξ and ξ with marginals Q and Q , respectively (cid:41) . The Wasserstein distance W p (cid:0) Q , Q (cid:1) can be viewed as the ( p -th root of the) minimum cost for movingthe distribution Q to Q , where the cost of moving a unit mass from ξ to ξ amounts to (cid:107) ξ − ξ (cid:107) p . Thejoint distribution Π of ξ and ξ is therefore naturally interpreted as a mass transportation plan.We define the ambiguity set as a p -Wasserstein ball in M p (Ξ) centered at the empirical distribution (cid:98) P N defined in (8). Specifically, we define the p -Wasserstein ball of radius ε around (cid:98) P N as B pε ( (cid:98) P N ) := (cid:110) Q ∈ M p (Ξ) : W p (cid:0) Q , (cid:98) P N (cid:1) ≤ ε (cid:111) . Note that if ε = 0 and the empirical distribution is supported on Ξ, which is necessarily true in the absenceof measurement noise, then the Wasserstein ball B pε ( (cid:98) P N ) shrinks to the singleton set that contains only theempirical distribution. In this case, the distributionally robust inverse optimization problem (13) reduces tothe empirical risk minimization problem (8). In order to establish out-of-sample guarantees, we must assumethat the p -Wasserstein distance between P in and P out is bounded by a known constant ε ≥
0. This is thecase, for instance, if the noise is additive and all noise realizations are bounded by ε with certainty. Proposition 4.4 (Out-of-sample guarantee) . Assume that there exists a > with A := E P in [exp( (cid:107) ξ (cid:107) a )] < ∞ .Assume also that β ∈ (0 , is a prescribed significance level, W p (cid:0) P in , P out (cid:1) ≤ ε , and m + n (cid:54) = 2 p . Then,there exist constants c , c > that depend only on a , A , m and n such that P N in [ P out ∈ B pε ( (cid:98) P N )] ≥ − β whenever ε ≥ ε + ε N ( β ) , where ε N ( β ) := (cid:16) log( c β − ) c N (cid:17) min (cid:8) p ( m + n ) − , (cid:9) if N ≥ log( c β − ) c , (cid:16) log( c β − ) c N (cid:17) if N < log( c β − ) c . (14) Proof.
Select any ε ≥ ε + ε N ( β ). Then, the triangle inequality impliesW p (cid:0) P out , (cid:98) P N (cid:1) ≤ W p (cid:0) P out , P in (cid:1) + W p (cid:0) P in , (cid:98) P N (cid:1) . By assumption, the first term on the right hand side is bounded by ε with certainty. Theorem 3.5 in [30],which leverages a powerful measure concentration result developed in [21, Theorem 2], further guaranteesthat the second term is bounded above by ε N ( β ) with confidence 1 − β . As P out is supported on Ξ, we maythus conclude that P out ∈ B pε ( (cid:98) P N ) with probability 1 − β . This observation completes the proof. (cid:3) Proposition 4.4 readily extends to the case n + m = 2 p at the expense of additional notation by leveraging [21, Theorem 2]. Proposition 4.4 ensures that the distributionally robust solution (cid:98) θ N and the corresponding certificate (cid:98) J N induced by a Wasserstein ambiguity set of radius ε ≥ ε + ε N ( β ) satisfy the out-of-sample guarantee (12).One can further show that if there is no measurement noise ( P in = P out = P ) while β N ∈ (0 ,
1) for N ∈ N satisfies (cid:80) ∞ N =1 β N < ∞ and lim N →∞ ε N ( β N ) = 0, then any accumulation point of { (cid:98) θ N } N ∈ N is P ∞ -almostsurely a minimizer of the the out-of-sample risk ρ P ( (cid:96) θ ) over θ ∈ Θ; see [30, Theorem 3.6].Besides offering rigorous out-of-sample and asymptotic guarantees, the proposed approach to distribution-ally robust inverse optimization can be shown to be tractable if the search space contains only linear orquadratic hypotheses, and risk is measured by the CVaR of the suboptimality loss (4b) or the first-orderloss (4c). Unfortunately, tractability is lost when minimizing the predictability loss (4a), which is the actualquantitiy of interest. However, the computable distributionally robust solutions ( (cid:98) θ s N , (cid:98) J s N ) and ( (cid:98) θ f N , (cid:98) J f N ) cor-responding to the suboptimality and first-order losses, respectively, can be used to construct out-of-sampleguarantees for the predictability loss if the hypotheses are uniformly strongly convex with parameter γ > (cid:96) p θ , (cid:96) s θ and (cid:96) f θ denote the predictability, suboptimality and first-order losses, respectively, we have P N in (cid:104) ρ P out ( (cid:96) s (cid:98) θ s N ) ≤ (cid:98) J s N (cid:105) ≥ − β = ⇒ P N in (cid:20) ρ P out ( (cid:96) p (cid:98) θ s N ) ≤ γ (cid:98) J s N (cid:21) ≥ − β (15a)and P N in (cid:104) ρ P out ( (cid:96) f (cid:98) θ f N ) ≤ (cid:98) J f N (cid:105) ≥ − β = ⇒ P N in (cid:20) ρ P out ( (cid:96) p (cid:98) θ f N ) ≤ γ (cid:98) J f N (cid:21) ≥ − β, (15b)where both implications follow from the dominance relation (6) established in Proposition 2.5 and the scaleinvariance and monotonicity of the CVaR. Thus, both (cid:98) θ s N and (cid:98) θ f N are efficiently computable and offer an out-of-sample guarantee for the predictability loss. While both guarantees involve the same confidence level 1 − β ,however, the underlying certificates J s N and J f N are generically different. One can again use the dominancerelation (6) from Proposition 2.5 and the monotonicity of the CVaR to show that J s N ≤ J f N . Thus, minimizingthe suboptimality loss results in a weakly stronger predictability guarantee. This reasoning suggests that theobserver should favor the suboptimality loss (4b) over the first-order loss (4c). We emphasize that unlike thesuboptimality loss, however, the first-order loss leads to tractable inverse optimization models even in thepresence of several strategically interacting agents [11]. Remark 4.5 (Out-of-sample guarantees in [11]) . The out-of-sample guarantees provided in [11] only applyto the
VaR , and an extension to other risk measures is not envisaged. Specifically, [11, Theorem 6] providesan out-of-sample guarantee for the
VaR of the first-order loss, while [11, Theorem 6] offers an out-of-sampleguarantee for the
VaR of the predictability loss. In contrast, the distributionally robust approach discussedhere offers out-of-sample guarantees for any normalized, positive homogeneous and monotone risk measureincluding the
VaR or any coherent risk measure such as the
CVaR etc. Linear Hypotheses
On the one hand, the hypothesis space F should be rich enough to contain the agent’s unknown trueobjective function F . On the other hand, F should be small enough to ensure tractability of the distri-butionally robust inverse optimization problem (13) and to prevent degeneracy of its optimal solutions. Aparticular class F that strikes this delicate balance and proves useful in many applications is the family oflinear objective functions F θ ( s, x ) := (cid:10) θ, x (cid:11) . The corresponding search space Θ ⊆ R n may account for priorinformation on the agent’s objective and should not contain θ = 0, which corresponds to a trivial constantobjective function that renders every response optimal. Examples of tractable search spaces are listed below. Example . If there is no prior information on F , it is natural to setΘ := (cid:8) θ ∈ R n : (cid:107) θ (cid:107) ∞ = 1 (cid:9) . (16a) A possible choice is β N = exp( −√ N ). ATA-DRIVEN INVERSE OPTIMIZATION WITH IMPERFECT INFORMATION 11
Note that the normalization (cid:107) θ (cid:107) ∞ = 1 is non-restrictive because the objective functions corresponding to θ and κ θ imply the same preferences for any model θ (cid:54) = 0 and scaling factor κ >
0. In fact, we could define Θas the unit sphere induced by any norm on R n . However, the ∞ -norm stands out from a computationalperspective. While all norm spheres are non-convex and therefore a priori unattractive as search spaces, the ∞ -norm sphere decomposes into 2 n polytopes—one for each facet. This polyhedral decomposition propertyallows us to optimize efficiently over Θ.If F is known to be non-decreasing in the agent’s decisions, a natural choice isΘ := { θ ∈ R n : (cid:107) θ (cid:107) = 1 , θ ≥ } . (16b)This search space has been used in [27] and constitutes a single convex polytope.If F is believed to reside in the vicinity of a nominal objective function (cid:10) θ , x (cid:11) as in [4], then we may setΘ := (cid:8) θ ∈ R n : (cid:107) θ − θ (cid:107) ≤ Γ (cid:9) , (16c)where (cid:107) · (cid:107) denotes a generic norm, and Γ reflects the degree of uncertainty about the nominal model θ .When focusing on linear hypotheses, the suboptimality loss function (4b) reduces to F θ ( s, x ) − min y ∈ X ( s ) F θ ( s, y ) = (cid:10) θ, x (cid:11) − min y ∈ X ( s ) (cid:10) θ, y (cid:11) = max y ∈ X ( s ) (cid:10) θ, x − y (cid:11) = max y ∈ X ( s ) (cid:10) ∇ x F θ ( s, x ) , x − y (cid:11) (17)and thus equals the first-order loss (4c), which is positive homogeneous and subadditive in θ . The tractabilityresults to be established below rely on the following assumption. Assumption 5.1 (Conic representable support) . The signal space S and the feasible set X ( s ) are conicrepresentable, that is, S = (cid:8) s ∈ R m : Cs (cid:23) C d (cid:9) and X ( s ) = { x ∈ R n : W x (cid:23) K Hs + h } ∀ s ∈ S , where the relations ‘ (cid:23) C ’ and ‘ (cid:23) K ’ represent conic inequalities with respect to some proper convex cones C and K of appropriate dimensions, respectively. The set Ξ of all possible signal-response pairs thus reduces to Ξ = (cid:8) ( s, x ) ∈ R m × R n : Cs (cid:23) C d, W x (cid:23) K Hs + h (cid:9) . We also assume that the convex set Ξ admits a Slater point. Under Assumption 5.1, the suboptimality loss (cid:96) θ ( s, x ) is concave in ( s, x ) for every fixed θ , see, e.g. , [14,Section 3.2.5]. We are now ready to state our first tractability result for the class of linear hypotheses. Theorem 5.2 (Linear hypotheses and suboptimality loss) . Assume that F represents the class of linearhypotheses with a search space of the form (16) and that Assumption 5.1 holds. If the observer uses thesuboptimality loss (4b) and measures risk using the CVaR at level α ∈ (0 , , then the distributionally robustinverse optimization problem (13) over the -Wasserstein ball is equivalent to the finite conic program minimize τ + 1 α (cid:32) ελ + 1 N N (cid:88) i =1 r i (cid:33) subject to θ ∈ Θ , λ ∈ R + , τ, r i ∈ R , φ i , φ i ∈ C ∗ , µ i , µ i , γ i ∈ K ∗ ∀ i ≤ N (cid:10) C (cid:98) s i − d, φ i (cid:11) + (cid:10) W (cid:98) x i − H (cid:98) s i − h, µ i + γ i (cid:11) ≤ r i + τ ∀ i ≤ N (cid:10) C (cid:98) s i − d, φ i (cid:11) + (cid:10) W (cid:98) x i − H (cid:98) s i − h, µ i (cid:11) ≤ r i , θ = W (cid:124) γ i ∀ i ≤ N (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:32) C (cid:124) φ i − H (cid:124) ( µ i + γ i ) W (cid:124) ( µ i + γ i ) (cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗ ≤ λ, (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:32) C (cid:124) φ i − H (cid:124) µ i W (cid:124) µ i (cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗ ≤ λ ∀ i ≤ N. (18)We emphasize that Theorem 5.2 remains valid if the training samples are inconsistent with the givensupport information, that is, if ( (cid:98) s i , (cid:98) x i ) / ∈ Ξ for some i ≤ N , in which case (cid:96) θ ( (cid:98) s i , (cid:98) x i ) can even be negative. Strictly speaking, if Θ is an ∞ -norm ball of the form (16a), then (18) can be viewed as a family of 2 n finite conic programsbecause Θ is non-convex but decomposes into 2 n convex polytopes. Proof of Theorem 5.2.
By the definition of CVaR, the objective function of (13) can be expressed assup Q ∈ B ε ( (cid:98) P N ) ρ Q ( (cid:96) θ ) = sup Q ∈ B ε ( (cid:98) P N ) inf τ τ + 1 α E Q (cid:2) max { (cid:96) θ ( s, x ) − τ, } (cid:3) (19) = inf τ τ + 1 α sup Q ∈ B ε ( (cid:98) P N ) E Q (cid:2) max { (cid:96) θ ( s, x ) − τ, } (cid:3) . The interchange of the maximization over Q and the minimization over τ in the second line is justified bySion’s minimax theorem [37], which applies because the Wasserstein ball B ε ( (cid:98) P N ) is weakly compact [12, p.2298]. The subordinate worst-case expectation problem in the second line of (19) constitutes a semi-infinitelinear program. As the corresponding integrand is given by the maximum of (cid:96) θ ( s, x ) − τ and 0, both of whichcan be viewed as proper concave functions in ( s, x ), this worst-case expectation problem admits a strong dualsemi-infinite linear program of the forminf λ ≥ ,r i ελ + N N (cid:80) i =1 r i s.t. sup ( s,x ) ∈ Ξ sup y ∈ X ( s ) (cid:10) θ, x − y (cid:11) − τ − λ (cid:107) ( s, x ) − ( (cid:98) s i , (cid:98) x i ) (cid:107) ≤ r i ∀ i ≤ N sup ( s,x ) ∈ Ξ − λ (cid:107) ( s, x ) − ( (cid:98) s i , (cid:98) x i ) (cid:107) ≤ r i ∀ i ≤ N, (20)see Theorem 4.2 in [30] for a detailed derivation of (20) for more general integrands. By the definitions of X ( s ) and Ξ put forth in Assumption 5.1, respectively, the i -th member of the first constraint group in (20)holds if and only if the optimal value of the conic programsup s,x,y (cid:10) θ, x − y (cid:11) − τ − λ (cid:107) ( s, x ) − ( (cid:98) s i , (cid:98) x i ) (cid:107) s.t. Cs (cid:23) C d, W x (cid:23) K Hs + h, W y (cid:23) K Hs + h is smaller or equal to r i . The dual of this conic program is given byinf (cid:10) C (cid:98) s i − d, φ i (cid:11) + (cid:10) W (cid:98) x i − H (cid:98) s i − h, µ i + γ i (cid:11) − τ s.t. φ i ∈ C ∗ , µ i , γ i ∈ K ∗ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:32) C (cid:124) φ i − H (cid:124) ( µ i + γ i ) W (cid:124) ( µ i + γ i ) (cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗ ≤ λ, θ = W (cid:124) γ i , and strong duality holds because the uncertainty set Ξ contains a Slater point due to Assumption 5.1. Thus,the i -th member of the first constraint group in (20) holds if and only if there exist φ i ∈ C ∗ and µ i , γ i ∈ K ∗ such that θ = W (cid:124) γ i , (cid:10) C (cid:98) s i − d, φ i (cid:11) + (cid:10) W (cid:98) x i − H (cid:98) s i − h, µ i + γ i (cid:11) ≤ r i + τ and (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:32) C (cid:124) φ i − H (cid:124) ( µ i + γ i ) W (cid:124) ( µ i + γ i ) (cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗ ≤ λ. A similar reasoning shows that the i -th member of the second constraint group in (20) holds if and only ifthere exist φ i ∈ C ∗ and µ i ∈ K ∗ such that (cid:10) C (cid:98) s i − d, φ i (cid:11) + (cid:10) W (cid:98) x i − H (cid:98) s i − h, µ i (cid:11) ≤ r i and (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:32) C (cid:124) φ i − H (cid:124) µ i W (cid:124) µ i (cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗ ≤ λ. ATA-DRIVEN INVERSE OPTIMIZATION WITH IMPERFECT INFORMATION 13
In summary, the worst-case expectation in the second line of (19) thus coincides with the optimal value ofthe finite conic programinf ελ + N N (cid:80) i =1 r i s.t. λ ∈ R + , r i ∈ R , φ i , φ i ∈ C ∗ , µ i , µ i , γ i ∈ K ∗ ∀ i ≤ N (cid:10) C (cid:98) s i − d, φ i (cid:11) + (cid:10) W (cid:98) x i − H (cid:98) s i − h, µ i + γ i (cid:11) ≤ r i + τ ∀ i ≤ N (cid:10) C (cid:98) s i − d, φ i (cid:11) + (cid:10) W (cid:98) x i − H (cid:98) s i − h, µ i (cid:11) ≤ r i , θ = W (cid:124) γ i ∀ i ≤ N (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:32) C (cid:124) φ i − H (cid:124) ( µ i + γ i ) W (cid:124) ( µ i + γ i ) (cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗ ≤ λ, (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:32) C (cid:124) φ i − H (cid:124) µ i W (cid:124) µ i (cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗ ≤ λ ∀ i ≤ N .
The claim then follows by substituting this conic program into (19). (cid:3)
If ( (cid:98) s i , (cid:98) x i ) ∈ Ξ for all i ≤ N , then the conic program (18) simplifies. In this case, the maximization problemsin the last constraint group of (20) all evaluate to zero, which implies that (cid:10) C (cid:98) s i − d, φ i (cid:11) + (cid:10) W (cid:98) x i − H (cid:98) s i − h, µ i (cid:11) ≤ r i reduces to r i ≥
0, while the decision variables φ i and µ i as well as the constraints (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:32) C (cid:124) φ i − H (cid:124) µ i W (cid:124) µ i (cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗ ≤ λ can be omitted from (18) for all i ≤ N .For stress test experiments it is often desirable to know the extremal distribution that achieves the worst-case risk in (13). The following theorem shows that this extremal distribution can be constructed systemati-cally for any fixed θ ∈ Θ by solving a finite convex optimization problem akin to (18).
Theorem 5.3 (Worst-case distribution for linear hypotheses) . Under the assumptions of Theorem 5.2, theworst-case risk in (18) corresponding to a fixed θ ∈ Θ coincides with the optimal value of a finite convexprogram, i.e. , sup Q ∈ B pε ( (cid:98) P N ) CVaR Q α ( (cid:96) θ ) = max 1 αN N (cid:88) i =1 π i (cid:96) θ (cid:18) p i π i , q i π i (cid:19) s.t. π ij ∈ R + , p ij ∈ R m , q ij ∈ R n ∀ i ≤ N, j ≤ p ij π ij ∈ S , q ij π ij ∈ X ( p ij π ij ) ∀ i ≤ N, j ≤ π i + π i = 1 , N N (cid:88) i =1 π i = α ∀ i ≤ N N N (cid:88) i =1 2 (cid:88) j =1 π ij (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:32) p ij π ij − (cid:98) s iq ij π ij − (cid:98) x i (cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ ε. (21) For any optimal solution { π (cid:63)ij , p (cid:63)ij , q (cid:63)ij } of this convex program, the discrete distribution Q (cid:63) := 1 N N (cid:88) i =1 2 (cid:88) j =1 π (cid:63)ij δ ξ (cid:63)ij with ξ (cid:63)ij := (cid:32) p (cid:63)ij π (cid:63)ij , q (cid:63)ij π (cid:63)ij (cid:33) (cid:124) belongs to the Wassertein ball B ε ( (cid:98) P N ) and attains the supremum on the left hand side of (21) .Proof. As the loss function (17) is proper and jointly concave in x and s , we can use a similar reasoning asin [30, Theorem 4.5] to show that the convex program on the right hand side of (21) coincides with the strongdual of (18) for any fixed θ ∈ Θ. If this convex program is solvable and { π (cid:63)ij , p (cid:63)ij , q (cid:63)ij } is a maximizer, then Q (cid:63) ∈ B ε ( (cid:98) P N ) due to [30, Corollary 4.7]. It remains to be shown that CVaR Q (cid:63) α ( (cid:96) θ ) is no smaller than (18). Indeed, by the definition of CVaR we haveCVaR Q (cid:63) α ( (cid:96) θ ) = inf τ ∈ R τ + 1 αN N (cid:88) i =1 2 (cid:88) j =1 π (cid:63)ij max (cid:40) (cid:96) θ (cid:32) p (cid:63)ij π (cid:63)ij , q (cid:63)ij π (cid:63)ij (cid:33) − τ, (cid:41) = sup ≤ ν ij ≤ π (cid:63)ij αN N (cid:88) i =1 2 (cid:88) j =1 ν ij (cid:96) θ (cid:32) p (cid:63)ij π (cid:63)ij , q (cid:63)ij π (cid:63)ij (cid:33) : α = 1 N N (cid:88) i =1 2 (cid:88) j =1 ν ij ≥ αN N (cid:88) i =1 π (cid:63)i (cid:96) θ (cid:18) p (cid:63)i π (cid:63)i , q (cid:63)i π (cid:63)i (cid:19) = sup Q ∈ B pε ( (cid:98) P N ) CVaR Q α ( (cid:96) θ ) . Here, the second equality follows from strong linear programming duality, while the inequality follows fromthe feasibility of the solution ν i = π (cid:63)i and ν i = 0 for i ≤ N . (cid:3) We close this section by generalizing Theorem 5.2 to the bounded rationality loss (10). The proof largelyparallels that of Theorem 5.2 and is therefore omitted for brevity.
Corollary 5.4 (Linear hypotheses and bounded rationality loss) . Assume that F represents the class oflinear hypotheses with search space of the form (16) and that Assumptions 5.1 holds. If the observer usesthe bounded rationality loss (10) and measures risk using the CVaR at level α ∈ (0 , , then the inverseoptimization problem (13) over the -Wasserstein ball is equivalent to a variant of the conic program (18) with an additional decision variable τ ∈ R + and where the first constraint group is replaced with (cid:10) C (cid:98) s i − d, φ i (cid:11) + (cid:10) W (cid:98) x i − H (cid:98) s i − h, µ i + γ i (cid:11) ≤ r i + τ + δ ∀ i ≤ N. Quadratic Hypotheses
Optimization problems with quadratic objectives abound in control [5], statistics [22], finance [29] andmany other application domains. Algorithms for inverse optimization that can learn quadratic objectivefunctions from signal-response pairs are therefore of great practical interest. This motivates us to considerthe class F of quadratic hypotheses of the form F θ ( s, x ) := (cid:10) x, Q xx x (cid:11) + (cid:10) x, Q xs s (cid:11) + (cid:10) q, x (cid:11) , which are encodedby a parameter θ := ( Q xx , Q xs , q ). The corresponding search space should account for prior information onthe agent’s objective and should exclude θ = 0. Examples of tractable search spaces are listed below. Example . If F is only known to be strongly convex in x , it is natural to setΘ := (cid:26) θ = ( Q xx , Q xs , q ) ∈ R n × n × R n × m × R n : Q xx (cid:23) I (cid:27) . (22a)Note that the normalization Q xx (cid:23) I is non-restrictive because a positive scaling of the objective functiondoes not alter the agent’s preferences.If F is only known to be bilinear in s and x , it is natural to setΘ := (cid:26) θ = ( Q xx , Q xs , q ) ∈ R n × n × R n × m × R n : Q xx (cid:23) , Q xs = I (cid:27) , (22b)where the normalization Q xs = I can always be enforced by redefining s if necessary.If F is close to a nominal objective function (cid:10) x, Q xx x (cid:11) + (cid:10) x, Q xs s (cid:11) + (cid:10) q , x (cid:11) , then we may setΘ := (cid:26) θ = ( Q xx , Q xs , q ) ∈ R n × n × R n × m × R n : Q xx (cid:23) , (cid:107) θ − θ (cid:107) ≤ Γ (cid:27) , (22c)where (cid:107) · (cid:107) denotes a generic norm, and Γ captures the uncertainty of the nominal model θ = ( Q xx , Q xs , q ).When focusing on quadratic candidate objective functions, the suboptimality loss (4b) reduces to (cid:96) θ ( s, x ) = (cid:10) x, Q xx x + Q xs s + q (cid:11) − min y ∈ X ( s ) (cid:10) y, Q xx y + Q xs s + q (cid:11) = max y ∈ X ( s ) (cid:10) x, Q xx x + Q xs s + q (cid:11) − (cid:10) y, Q xx y + Q xs s + q (cid:11) . (23) ATA-DRIVEN INVERSE OPTIMIZATION WITH IMPERFECT INFORMATION 15
As in Section 5, we suppose that Assumption 5.1 holds. In this setting the agent’s decision problem (1)constitutes a conic program and is therefore tractable for common choices of the cones C and K . In contrast,the inverse optimization problem (13) is hard. In fact, it is already hard to evaluate the objective functionof (13) for a fixed θ . As we work with quadratic objectives, throughout this section we use the 2-norm on thesignal-response space and the 2-Wasserstein metric to measure distances of distributions. Theorem 6.1 (Intractability of (13) for quadratic hypotheses) . Assume that F represents the class of qua-dratic hypotheses with search space (22a) and that Assumption 5.1 holds. If the observer uses the suboptimalityloss (4b) and measures risk using the CVaR at level α ∈ (0 , , then evaluating the objective function of (13) for a fixed θ ∈ Θ is NP-hard even if N = 1 (there is only one data point), α = 1 (the observer is risk-neutral), S is a singleton, X ( s ) is a polytope independent of s , and Q xs = 0 .Proof. The proof relies on a reduction from the NP-hard quadratic maximization problem [31].
Quadratic Maximization
Instance.
A positive definite matrix Q = Q (cid:124) (cid:23) I . Goal.
Evaluate max (cid:107) x (cid:107) ∞ ≤ (cid:10) x, Qx (cid:11) .Given an input Q (cid:23) I to the quadratic maximization problem, we construct an instance of the inverseoptimization problem (13) with N = 1, α = 1, and Wasserstein radius ε = √ n , where (cid:98) s := 0 , (cid:98) x := 0 , Q xx := Q, Q xs := 0 , S := { } , X ( s ) := { x ∈ R n : (cid:107) x (cid:107) ∞ ≤ } . Under this parametrization, the objective function of (13) reduces tosup Q ∈ B ε ( (cid:98) P N ) ρ Q ( (cid:96) θ ) = sup Q ∈ B ε ( (cid:98) P N ) E Q (cid:20) max y ∈ X ( s ) (cid:10) x, Q xx x + Q xs s + q (cid:11) − (cid:10) y, Q xx y + Q xs s + q (cid:11)(cid:21) = sup Q ∈ B ε ( (cid:98) P N ) E Q (cid:20) max y ∈ X ( s ) (cid:10) x, Qx (cid:11) − (cid:10) y, Qy (cid:11)(cid:21) ≤ sup s ∈ S ,x ∈ X ( s ) max y ∈ X ( s ) (cid:10) x, Qx (cid:11) − (cid:10) y, Qy (cid:11) = sup (cid:107) x (cid:107) ∞ ≤ (cid:10) x, Qx (cid:11) , where the inequality in the third line follows from the inclusion B ε ( (cid:98) P N ) ⊆ M (Ξ), while the last equalityholds because the innermost maximum is attained at y = 0. As ( s, x ) ∈ Ξ if and only if s = 0 and (cid:107) x (cid:107) ∞ ≤ s, x ) ∈ Ξ implies (cid:107) x (cid:107) ≤ √ n and (cid:107) ( s, x ) (cid:107) ≤ n = ε . Moreover, as the empirical distribution (cid:98) P N coincides with the Dirac point measure at 0, the Wasserstein ball B ε ( (cid:98) P N ) thus contains all distributionssupported on Ξ, implying that the inequality in the above expression is in fact an equality. Hence, evaluatingthe objective function of (13) is tantamount to solving the NP-hard quadratic maximization problem. Thisobservation completes the proof. (cid:3) Corollary 6.2 (Intractability of (13) for bilinear hypotheses) . If all assumptions of Theorem 6.1 hold but F denotes the class of bilinear hypotheses with search space (22b) , then evaluating the objective of (13) for afixed θ ∈ Θ is NP-hard even if N = 1 , α = 1 , S is a singleton, X ( s ) is a polytope independent of s and Q xx = 0 .Proof. The proof is similar to that of Theorem 6.1 and omitted for brevity. (cid:3)
Corollary 6.2 asserts that the inverse optimization problem (13) is intractable even if we focus on linearcandidate objectives that depend on the exogenous signal s . This finding contrasts with the tractabilityTheorem 5.2 for candidate objectives independent of s . The intractability results portrayed in Theorem 6.1and Corollary 6.2 motivate us to devise a safe conic approximation for the inverse optimization problem (13)with quadratic candidate objective functions. Theorem 6.3 (Quadratic hypotheses and suboptimality loss) . Assume that F represents the class of qua-dratic hypotheses with a search space of the form (22) and that Assumption 5.1 holds. If the observer usesthe suboptimality loss (4b) and measures risk using the CVaR at level α ∈ (0 , , then the following conicprogram provides a safe approximation for the distributionally robust inverse optimization problem (13) overthe -Wasserstein ball: minimize τ + 1 α (cid:32) ε λ + 1 N N (cid:88) i =1 r i (cid:33) subject to θ = ( Q xx , Q xs , q ) ∈ Θ , λ ∈ R + , τ, r i , ρ i , ρ i ∈ R ∀ i ≤ Nφ i , φ i ∈ C ∗ , µ i , µ i , γ i ∈ K ∗ , χ i , χ i ∈ R m , ζ i , η i , ζ i ∈ R n ∀ i ≤ Nχ i = ( − C (cid:124) φ i + H (cid:124) ( µ i + γ i ) − λ (cid:98) s i ) ∀ i ≤ Nζ i = ( − q − W (cid:124) µ i − λ (cid:98) x i ) , η i = ( q − W (cid:124) γ i ) ∀ i ≤ Nρ i = τ + r i + λ (cid:0)(cid:10)(cid:98) x i , (cid:98) x i (cid:11) + (cid:10)(cid:98) s i , (cid:98) s i (cid:11)(cid:1) + (cid:10) d, φ i (cid:11) + (cid:10) h, µ i + γ i (cid:11) ∀ i ≤ Nχ i = ( − C (cid:124) φ i + H (cid:124) µ i − λ (cid:98) s i ) ∀ i ≤ Nζ i = ( − W (cid:124) µ i − λ (cid:98) x i ) ∀ i ≤ Nρ i = r i + λ (cid:0)(cid:10)(cid:98) x i , (cid:98) x i (cid:11) + (cid:10)(cid:98) s i , (cid:98) s i (cid:11)(cid:1) + (cid:10) d, φ i (cid:11) + (cid:10) h, µ i (cid:11) ∀ i ≤ N λ I − Q (cid:124) xs Q (cid:124) xs χ i − Q xs λ I − Q xx ζ i Q xs Q xx η i χ (cid:124) i ζ (cid:124) i η (cid:124) i ρ i (cid:23) , λ I χ i λ I ζ i χ (cid:124) i ζ (cid:124) i ρ i (cid:23) ∀ i ≤ N. (24)Note that Theorem 6.3 remains valid if ( (cid:98) s i , (cid:98) x i ) / ∈ Ξ for some i ≤ N . Proof of Theorem 6.3.
As in the proof of Theorem 5.2 one can show that the objective function of the inverseoptimization problem (13) coincides with a variant of (19) that involves the 2-Wasserstein ball. In theremainder, we derive a safe conic approximation for the subordinate worst-case expectation problemsup Q ∈ B ε ( (cid:98) P N ) E Q (cid:2) max { (cid:96) θ ( s, x ) − τ, } (cid:3) . (25)Duality arguments borrowed from [30, Theorem 4.2] imply that the above infinite-dimensional linear programadmits a strong dual of the forminf λ ≥ ,r i ε λ + N N (cid:80) i =1 r i s.t. sup s ∈ S , x,y ∈ X ( s ) (cid:10) x, Q xx x + Q xs s + q (cid:11) − (cid:10) y, Q xx y + Q xs s + q (cid:11) − τ − λ (cid:107) ( s, x ) − ( (cid:98) s i , (cid:98) x i ) (cid:107) ≤ r i ∀ i ≤ N sup s ∈ S , x ∈ X ( s ) − λ (cid:107) ( s, x ) − ( (cid:98) s i , (cid:98) x i ) (cid:107) ≤ r i ∀ i ≤ N. (26)By using the definitions of S and X ( s ) put forth in Assumption 5.1, the i -th member of the first constraintgroup in (26) is satisfied if and only if the optimal value of the maximization problemsup s,x,y (cid:10) x, Q xx x + Q xs s + q (cid:11) − (cid:10) y, Q xx y + Q xs s + q (cid:11) − τ − λ (cid:107) ( s, x ) − ( (cid:98) s i , (cid:98) x i ) (cid:107) s.t. Cs (cid:23) C d, W x (cid:23) K Hs + h, W y (cid:23) K Hs + h (27)does not exceed r i . The Lagrangian of the conic quadratic program (27) is defined as L ( s, x, y ; φ i , µ i , γ i ) := (cid:10) x, Q xx x + Q xs s + q (cid:11) − (cid:10) y, Q xx y + Q xs s + q (cid:11) − τ − λ (cid:107) ( s, x ) − ( (cid:98) s i , (cid:98) x i ) (cid:107) (28) + (cid:10) Cs − d, φ i (cid:11) + (cid:10) W x − Hs − h, µ i (cid:11) + (cid:10) W y − Hs − h, γ i (cid:11) , ATA-DRIVEN INVERSE OPTIMIZATION WITH IMPERFECT INFORMATION 17 and therefore (27) can be expressed as a max-min problem of the formsup s,x,y inf φ i ∈C ∗ inf µ i ,γ i ∈K ∗ L ( s, x, y ; φ i , µ i , γ i ) ≤ inf φ i ∈C ∗ inf µ i ,γ i ∈K ∗ sup s,x,y L ( s, x, y ; φ i , µ i , γ i ) , where the inequality follows from weak duality. As Ξ contains a Slater point by virtue of Assumption 5.1,strong duality holds (meaning that the inequality collapses to an equality) if the Lagrangian is concavein ( s, x, y ); see also Proposition 6.4 below. We conclude that the i -th member of the first constraint groupin (26) holds if there exist φ i ∈ C ∗ and µ i , γ i ∈ K ∗ with sup s,x,y L ( s, x, y ; φ i , µ i , γ i ) ≤ r i . As the Lagrangianconstitutes a quadratic function, this statement is satisfied if and only if there are φ i ∈ C ∗ , µ i , γ i ∈ K ∗ , χ i ∈ R m , ζ i , η i ∈ R n and ρ i ∈ R with χ i = ( − C (cid:124) φ i + H (cid:124) ( µ i + γ i ) − λ (cid:98) s i ) ζ i = ( − q − W (cid:124) µ i − λ (cid:98) x i ) , η i = ( q − W (cid:124) γ i ) ρ i = τ + r i + λ (cid:0)(cid:10)(cid:98) x i , (cid:98) x i (cid:11) + (cid:10)(cid:98) s i , (cid:98) s i (cid:11)(cid:1) + (cid:10) d, φ i (cid:11) + (cid:10) h, µ i + γ i (cid:11) λ I − Q (cid:124) xs Q (cid:124) xs χ i − Q xs λ I − Q xx ζ i Q xs Q xx η i χ (cid:124) i ζ (cid:124) i η (cid:124) i ρ i (cid:23) . Similarly, it can be shown that the i -th member of the second constraint group in (26) is satisfied if and onlyif there exist φ i ∈ C ∗ , µ i ∈ K ∗ , χ i ∈ R m , ζ i ∈ R n and ρ i ∈ R such that χ i = ( − C (cid:124) φ i + H (cid:124) µ i − λP s (cid:98) s i ) ζ i = ( − W (cid:124) µ i − λ (cid:98) x i ) ρ i = r i + λ (cid:0)(cid:10)(cid:98) x i , (cid:98) x i (cid:11) + (cid:10)(cid:98) s i , (cid:98) s i (cid:11)(cid:1) + (cid:10) d, φ i (cid:11) + (cid:10) h, µ i (cid:11) λ I χ i λ I ζ i χ (cid:124) i ζ (cid:124) i ρ i (cid:23) . (29)Replacing the semi-infinite constraints in (26) with the respective semidefinite approximations shows that theworst-case expectation (25) is bounded above by the optimal value of the conic programinf ε λ + N N (cid:80) i =1 r i s.t. λ ∈ R + , r i , ρ i , ρ i ∈ R , φ i , φ i ∈ C ∗ , µ i , µ i , γ i ∈ K ∗ ∀ i ≤ Nχ i , χ i ∈ R m , ζ i , η i , ζ i ∈ R n ∀ i ≤ Nχ i = ( − C (cid:124) φ i + H (cid:124) ( µ i + γ i ) − λ (cid:98) s i ) ∀ i ≤ Nζ i = ( − q − W (cid:124) µ i − λ (cid:98) x i ) , η i = ( q − W (cid:124) γ i ) ∀ i ≤ Nρ i = τ + r i + λ (cid:0)(cid:10)(cid:98) x i , (cid:98) x i (cid:11) + (cid:10)(cid:98) s i , (cid:98) s i (cid:11)(cid:1) + (cid:10) d, φ i (cid:11) + (cid:10) h, µ i + γ i (cid:11) ∀ i ≤ Nχ i = ( − C (cid:124) φ i + H (cid:124) µ i − λ (cid:98) s i ) ∀ i ≤ Nζ i = ( − W (cid:124) µ i − λ (cid:98) x i ) ∀ i ≤ Nρ i = r i + λ (cid:0)(cid:10)(cid:98) x i , (cid:98) x i (cid:11) + (cid:10)(cid:98) s i , (cid:98) s i (cid:11)(cid:1) + (cid:10) d, φ i (cid:11) + (cid:10) h, µ i (cid:11) ∀ i ≤ N λ I − Q (cid:124) xs Q (cid:124) xs χ i − Q xs λ I − Q xx ζ i Q xs Q xx η i χ (cid:124) i ζ (cid:124) i η (cid:124) i ρ i (cid:23) , λ I χ i λ I ζ i χ (cid:124) i ζ (cid:124) i ρ i (cid:23) ∀ i ≤ N. (30)The claim then follows by substituting (30) into a suitable worst-case CVaR formula akin to (19). (cid:3) If ( (cid:98) s i , (cid:98) x i ) ∈ Ξ for all i ≤ N , then the conic program (24) simplifies. In this case, the maximization problemsin the last constraint group of (26) all evaluate to zero, which implies that the constraints (29) reduce to r i ≥
0, while the decision variables φ i , µ i , χ i , ζ i and ρ i can be omitted from (24) for all i ≤ N . In spite of the hardness results outlined in Theorems 6.1 and 6.2, the following proposition shows that thetractable approximation of Theorem 6.3 is sometimes exact.
Proposition 6.4 (Ex post exactness guarantee) . If an optimal solution to the safe conic approximation (24) from Theorem 6.3 satisfies the strict inequality λ I − Q xs (cid:124) Q xs (cid:124) − Q xs λ I − Q xx Q xs Q xx (cid:31) , (31) then (24) is equivalent to the original distributionally robust inverse optimization problem (24) . Our computational experiments suggest that the ex post exactness condition (31) is often satisfied if theWasserstein radius ε is not too large. Proof of Proposition 6.4.
From the proof of Theorem 6.3 we conclude that the optimal value of the distribu-tionally robust inverse optimization problem (13) can be represented as inf θ,λ,τ ϕ ( θ, λ, τ ), where ϕ ( θ, λ, τ ) := λε + 1 N N (cid:88) i =1 sup s ∈ S , x,y ∈ X ( s ) max (cid:110)(cid:10) x, Q xx x + Q xs s + q (cid:11)(cid:10) y, Q xx y + Q xs s + q (cid:11) − τ, (cid:111) − λ (cid:107) ( s, x ) − ( (cid:98) s i , (cid:98) x i ) (cid:107) for θ ∈ Θ and λ ≥
0, and ϕ ( θ, λ, τ ) = ∞ otherwise. By construction, ϕ is jointly convex in θ = ( Q xx , Q xs , q ), λ and τ . Moreover, it is clear that the optimal value of the finite convex program (24) can be representedas inf θ,λ,τ (cid:98) ϕ ( θ, λ, τ ), where (cid:98) ϕ ( θ, λ, τ ) denotes the infimum of (24) when the decision variables θ , λ and τ arefixed. The proof of Theorem 6.3 further implies that (cid:98) ϕ coincides with ϕ whenever the Lagrangian (28) isconcave in ( s, x, y ), which is the case if and only if − λ I Q (cid:124) xs − Q (cid:124) xs Q xs Q xx − λ I − Q xs − Q xx (cid:22) . (32)Note also that (cid:98) ϕ ( θ, λ, τ ) = ∞ whenever (32) is violated because (32) is implied by the constraints of (24). Insummary, ϕ and (cid:98) ϕ are both convex functions satisfying (cid:98) ϕ ( θ, λ, τ ) = (cid:40) ϕ ( θ, λ, τ ) if (32) holds,+ ∞ otherwise.Thus, any minimizer of (cid:98) ϕ satisfying the strict inequality (31) resides in the interior of the region where theconvex functions (cid:98) ϕ and ϕ coincide and must therefore also minimize ϕ . (cid:3) Finally, we generalize Theorem 6.3 to the bounded rationality loss (10). The proof largely parallels thatof Theorem 6.3 and is therefore omitted for brevity.
Corollary 6.5 (Quadratic hypotheses and bounded rationality loss) . Assume that F represents the class ofquadratic hypotheses with a search space of the form (22) and that Assumption 5.1 holds. If the observeruses the bounded rationality loss (10) and measures risk using the CVaR at level α ∈ (0 , , then the inverseoptimization problem (13) over the -Wasserstein ball is is conservatively approximated by a variant of theconic program (24) where τ ∈ R + and the defining equation of ρ i is replaced with ρ i = τ + r i + δ + λ (cid:0)(cid:10)(cid:98) x i , (cid:98) x i (cid:11) + (cid:10)(cid:98) s i , (cid:98) s i (cid:11)(cid:1) + (cid:10) d, φ i (cid:11) + (cid:10) h, µ i + γ i (cid:11) ∀ i ≤ N. The distributionally robust inverse optimization problem (13) also admits a safe convex approximationwhen the search space consists of a class of convex hypotheses of the type considered in [27]. Due to spacerestrictions we relegate this discussion to Appendix A.
ATA-DRIVEN INVERSE OPTIMIZATION WITH IMPERFECT INFORMATION 19 Numerical Experiments
We now compare the proposed distributionally robust approach to inverse optimization against the state-of-the-art techniques described in [6] and [10]. The first experiment aims to learn a linear hypothesis fromimperfect training samples, where the imperfection is explained by measurement noise or the agent’s boundedrationality. Similarly, the second experiment endeavors to learn a quadratic hypothesis from imperfect trainingsamples, where the imperfection is explained by measurement noise or model uncertainty.All experiments are run on an Intel XEON CPU with 3.40GHz clock speed and 16GB of RAM. All linear,quadratic and second-order cone programs are solved with CPLEX 12.6, and all semidefinite programs aresolved with MOSEK 8. In order to ensure that our experiments are reproducible, we make the underlyingsource codes available at https://github.com/sorooshafiee/InverseOptimization . Learning a Linear Hypothesis
The goal of this experiment is to learn a linear hypothesis from imperfect training samples where themeasured responses represent feasible perturbations of the true optimal responses. The perturbations can beexplained by measurement noise or by the agent’s bounded rationality. We argue that different causes of theperturbations necessitate different inverse optimization models.7.1.A.
Consistent Noisy Measurements
Decision problem of the agent:
We assume that the agent’s true objective function is F ( s, x ) = (cid:10) θ (cid:63) , x (cid:11) for some θ (cid:63) in the vicinity of a nominal model θ . The nominal model is sampled uniformly from Θ := (cid:8) θ ∈ R n : (cid:107) θ (cid:107) ∞ ∈ [1 , (cid:9) , while θ (cid:63) is sampled uniformly form Θ := (cid:8) θ ∈ R n : (cid:107) θ − θ (cid:107) ∞ ≤ (cid:9) . Thisconstruction implies that θ (cid:63) (cid:54) = 0 almost surely. We also assume that the agent’s feasible set is given by X ( s ) = { x ∈ R n : (cid:107) x (cid:107) ∞ ≤ , Ax ≥ s } , where the constraint matrix A is sampled uniformly from thehypercube [ − , m × n . This feasible set can be brought to the standard form of Assumption 5.1 by setting W = ( I , − I , A (cid:124) ) (cid:124) ∈ R (2 n + m ) × n , H = (0 , , I ) (cid:124) ∈ R (2 n + m ) × m , h = ( − , − , (cid:124) ∈ R n + m and K = R n + m + . For a fixed signal s , we denote the optimal value of the agent’s true decision problem by z (cid:63) ( s ). Generation of training samples:
A single signal is constructed as s = Av , where the auxiliary randomvariable v follows independent uniform distribution on [ − , n . Thus, if a i , i ≤ m , denote the rows of theconstraint matrix A , then the support of s can be expressed as S = { s ∈ R m : | s i | ≤ (cid:107) a i (cid:107) ∀ i ≤ m } . Thissupport can be brought to the standard form of Assumption 5.1 by setting C = ( I , − I ) (cid:124) ∈ R m × m , d = ( (cid:107) a (cid:107) , . . . , (cid:107) a m (cid:107) , (cid:107) a (cid:107) , . . . , (cid:107) a m (cid:107) ) (cid:124) ∈ R m and C = R m + . As v ∈ X ( s ) by construction, problem (1) is feasible for every s ∈ S . We assume that the agent’s responseto s is noisy in the sense that it constitutes a random δ -suboptimal solution to (1) for δ = 1. Specifically, weassume that x is obtained as a solution of an auxiliary optimization problemmin x ∈ X ( s ) (cid:8)(cid:10) θ rand , x (cid:11) : (cid:10) θ (cid:63) , x (cid:11) ≤ z (cid:63) ( s ) + δ (cid:9) , which minimizes a linear cost over the set of all δ -suboptimal solutions to (1). The gradient θ rand of the costis sampled uniformly from [ − , n . By construction, we have x ∈ X ( s ), which implies that ( s, x ) ∈ Ξ. Thus,the measurement noise is consistent with the know support of the exact signal-response pairs. The imperfectconsistent training samples ( (cid:98) s i , (cid:98) x i ), i ≤ N , are now generated independently using the above procedure. Decision problem of the observer:
The observer aims to identify a linear hypothesis F θ ( s, x ) := (cid:10) θ, x (cid:11) , θ ∈ Θ, that best predicts the agent’s responses to new signals, where the search space Θ is set to the ∞ -normball around the nominal model θ described above. We assume that the observer minimizes the suboptimalityloss (4b) and uses the expected value to measure risk. Moreover, the observer solves the distributionallyrobust inverse optimization problem (13) over a 1-Wasserstein ball around the empirical distribution on the -5 -4 -3 -2 -1 ε S ubop t i m a li t y P r ed i c t ab ili t y (a) Impact of the Wasserstein radius onthe suboptimality and predictibility risk N S ubop t i m a li t y BPDROVI (b) Suboptimality learning curve N P r ed i c t ab ili t y BPDROVI (c) Predictibility learning curve
Figure 1.
Out-of-sample suboptimality and predictability risks in the presence of imperfectconsistent training samples and perfect test samplestraining samples, where the ∞ -norm is used as the transportation cost on Ξ. Note that this problem can bereformulated as the tractable linear program (18) by virtue of Theorem 5.2. Out-of-sample risk:
To assess the quality of an estimator (cid:98) θ N obtained from (18), we evaluate its out-of-sample risk E P out ( (cid:96) (cid:98) θ N ) both with respect to the predictability loss (4a) and the suboptimality loss (4b),where P out represents the distribution of a single test sample ( s, x ) independent of the training samples, andwhere x is an exact (non-noisy) response to s in (1). In practice, the out-of-sample risk cannot be evaluatedanalytically but only numerically by using 1,000 independent test samples from P out . By relying on non-noisytest samples, we assess how well the estimator (cid:98) θ N can predict the agent’s exact optimal responses. Results:
All numerical results are averaged across 100 independent problems instances { θ , θ (cid:63) , A } . Thefirst experiment involves m = 50 signal variables, n = 50 decision variables and N = 10 training samples.Figure 1(a) shows how the out-of-sample risk of the optimal estimator (cid:98) θ N ( ε ) obtained from (18) changes withthe radius ε of the underlying Wasserstein ball. This experiment suggests that both the predictability andsuboptimality risks can be significantly reduced by using a distributionally robust inverse optimization modelwith a judiciously calibrated ambiguity set. Unfortunately, Figure 1(a), from which the optimal Wassersteinradii could be read off easily, is not available in the training phase as its construction requires large amountsof test samples. Instead, the Wasserstein radii offering the lowest out-of-sample risk must also be estimatedfrom the training data. To this end, we use the following k -fold cross validation algorithm:Partition (cid:98) ξ , . . . , (cid:98) ξ N into k folds, and repeat the following procedure for each fold i = 1 , . . . , k . Use the i -thfold as a validation dataset and merge the remaining k − N i . Using only the i -th training dataset, solve (18) for a large but finite number of candidate radii ε ∈ E to obtain an estimator (cid:98) θ N i ( ε ). Use the i -th validation dataset to estimate the out-of-sample risk of (cid:98) θ N i ( ε ) for each ε ∈ E . Set (cid:98) ε iN to any ε ∈ E that minimizes this quantity. After all folds have been processed, set (cid:98) ε N to the average of the (cid:98) ε iN for i = 1 , . . . , k , and re-solve (18) with ε = (cid:98) ε N using all N samples. Report the optimal solution (cid:98) θ N andthe optimal value (cid:98) J N of (18) as the recommended estimator and its corresponding certificate, respectively.Throughout all experiments we set k = min { , N } and E = { ε = b · c : b ∈ { , } , c ∈ {− , − , − , − }} .For brevity, we henceforth refer to the above cross-validation scheme as the distributionally robust optimiza-tion (DRO) approach. In the following, we compare the resulting DRO estimator against two state-of-the-artestimators from the literature. The first estimator is obtained from the variational inequality (VI) approachproposed in [10], which minimizes the first-order loss (4c) averaged across all training samples. By [10, ATA-DRIVEN INVERSE OPTIMIZATION WITH IMPERFECT INFORMATION 21
Theorem 3], the resulting inverse optimization problem is equivalent to the tractable linear programminimize 1 N N (cid:88) i =1 | r i | subject to (cid:10) W (cid:98) x i − H (cid:98) s i − h, γ i (cid:11) ≤ r i ∀ i ≤ NW (cid:62) γ i = θ ∀ i ≤ Nγ i ≥ ∀ i ≤ Nθ ∈ Θ . Note that as all training samples are consistent, that is, ( (cid:98) s i , (cid:98) x i ) ∈ Ξ for i = 1 , . . . , n , one can show thatall residuals r i are automatically non-negative, and the above linear program can be viewed as a specialinstance of (13) that minimizes the first-order loss, where the risk measure is set to the expected value andthe Wasserstein radius is set to zero. If there were inconsistent training samples that fall outside of Ξ, on theother hand, the absolute values of the residuals would be needed to prevent unboundedness.The second estimator is obtained from the bilevel programming (BP) approach proposed in [6], whichminimizes the predictability loss (4a) averaged across all training samples. As shown in [6, Section 2.2], theresulting inverse optimization problem is equivalent to the optimistic bilevel programminimize 1 N N (cid:88) i =1 (cid:107) (cid:98) x i − y i (cid:107) subject to y i ∈ arg min z (cid:110)(cid:10) z, θ (cid:11) : W z ≥ H (cid:98) s i + h (cid:111) θ ∈ Θ . Note that this bilevel program can be viewed as a special instance of (13) that minimizes the predictabilityloss, where the risk measure is set to the expected value and the Wasserstein radius is set to zero.The above bilevel program can be reformulated as a mixed integer quadratic program by replacing the‘arg min’-constraint with the Karush-Kuhn-Tucker optimality conditions of the agent’s linear program. In-deed, note that the resulting complementary slackness conditions can be linearized by introducing binaryvariables that identify the binding constraints. However, this approach requires big- M constants to boundthe optimal dual variables. As valid big- M constants are difficult to obtain in general and as overly conserva-tive big- M constants lead to numerical instability, we use here the YALMIP interface for bilevel programming,which calls a dedicated branch and bound algorithm that branches directly on the complementarity slacknessconditions [28]. Throughout our experiments we limit all branch and bound calculations to 2 ,
000 iterations.Figures 1(b) and 1(c) visualize the suboptimality and predictability learning curves , respectively, whichcapture how the out-of-sample risk of the different estimators changes with the number of training samples.As optimistic bilevel programs are NP-hard even when all objective and constraint functions are linear [8],this experiment focuses on problem instances with n = m = 10. Even for these moderate problem dimensions,however, the inverse optimization problems associated with the BP approach fails to find a feasible solutionfor more than 10 training samples. In contrast, all other approaches lead to tractable linear programs.Figure 1(b) shows that the DRO approach dominates the VI and BP approaches uniformly across all samplessizes in terms of out-of-sample suboptimality risk. This is reassuring as the DRO approach actually minimizessuboptimality risk. However, Figure 1(c) suggests that the DRO approach wins even in terms of out-of-samplepredictability risk. This is perhaps surprising because, unlike the BP approach, the DRO approach onlyminimizes an approximation of the predictability loss. We conclude that injecting distributional robustnessmay be more beneficial for out-of-sample performance than using the correct loss function.Tables 1 and 2 report the out-of-sample suboptimality and predictability risks, respectively, for the DROand VI estimators based on N = 10 training samples and for different signal and response dimensions.The BP approach is excluded from this comparison due to its intractability. We observe that the DROestimator always attains the lowest suboptimality risk and often the lowest predictability risk. Whenever theVI estimator wins, the predictability risks of the VI and DRO estimators are in fact almost identical. Table 1.
Out-of-sample suboptimality risk in the presence of noisy consistent measurements mn Methods 10 20 30 40 5010 VI 1 . × − . × − . × − . × − . × − DRO 1 . × − . × − . × − . × − . × −
20 VI 3 . × − . × − . × − . × − . × − DRO 2 . × − . × − . × − . × − . × −
30 VI 3 . × − . × − . × − . × − . × − DRO 3 . × − . × − . × − . × − . × −
40 VI 4 . × − . × − . × − . × − . × − DRO 3 . × − . × − . × − . × − . × −
50 VI 5 . × − . × − . × − . × − . × − DRO 4 . × − . × − . × − . × − . × − Table 2.
Out-of-sample predictability risk in the presence of consistent noisy measurements mn Methods 10 20 30 40 5010 VI 6 . × − . × − . × − . × − . × − DRO 5 . × − . × − . × − . × − . × −
20 VI 1 . × . × − . × − . × − . × − DRO 1 . × . × − . × − . × − . × −
30 VI 1 . × . × . × . × − . × − DRO 1 . × . × . × . × − . × −
40 VI 1 . × . × . × . × . × DRO 1 . × . × . × . × . ×
50 VI 1 . × . × . × . × . × DRO 1 . × . × . × . × . × -5 -4 -3 -2 -1 ε δ - S ubop t i m a li t y P r ed i c t ab ili t y (a) Impact of the Wasserstein radius onthe suboptimality and predictibility risk N δ - S ubop t i m a li t y BPDROVI (b) Suboptimality learning curve N P r ed i c t ab ili t y BPDROVI (c) Predictibility learning curve
Figure 2.
Out-of-sample suboptimality and predictability risks in the presence of imperfectconsistent training and test samples
ATA-DRIVEN INVERSE OPTIMIZATION WITH IMPERFECT INFORMATION 23
Table 3.
Out-of-sample suboptimality risk in the presence of bounded rationality mn Methods 10 20 30 40 5010 VI 4 . × − . × − . × − . × − . × − DRO 2 . × − . × − . × − . × − . × −
20 VI 7 . × − . × − . × − . × − . × − DRO 5 . × − . × − . × − . × − . × −
30 VI 1 . × − . × − . × − . × − . × − DRO 7 . × − . × − . × − . × − . × −
40 VI 1 . × − . × − . × − . × − . × − DRO 9 . × − . × − . × − . × − . × −
50 VI 1 . × − . × − . × − . × − . × − DRO 1 . × − . × − . × − . × − . × − Bounded Rationality
Consider the exact same setting as in Section 7.1.A but assume now that the agent suffers from boundedrationality. Specifically, assume that the agent selects random δ -suboptimal decisions that are perfectlymeasured by the observer. This means that the training samples are generated as in Section 7.1.A, but theimperfection of the training responses is now explained by the agent’s bounded rationality instead of theobserver’s noisy measurements. At this point one may wonder why a new interpretation of the imperfectionsshould impact the observer’s inverse optimization problem. In the following we will assume, however, thatthe observer is aware of the agent’s bounded rationality, knows the value of δ and aims to predict the agent’sactual suboptimal decisions. Therefore, the DRO approach to inverse optimization is subject to two changes: • The observer minimizes the bounded rationality loss (10) instead of the suboptimality loss (4b). • The test samples are generated in the same way as the training samples. Thus, the test responses nolonger constitute perfect minimizers of (1) but represent random δ -suboptimal solutions.Recall that the bounded rationality loss favors hypotheses that correctly predict δ -suboptimal responses.We also highlight that the observer’s inverse optimization problem with bounded rationality loss can bereformulated as a tractable linear program by virtue of Corollary 5.4. Moreover, by using imperfect testsamples, the out-of-sample risk is now measured under the distribution of an imperfect signal-response pair,which is the correct performance criterion given that the observer aims to predict imperfect responses.In analogy to Section 7.1.A, the impact of the Wasserstein radius on the out-of-sample suboptimality andpredictability risk is shown in Figure 2(a). The suboptimality and predictability learning curves of differentestimators are visualized in Figures 2(b) and 2(c), respectively. Here, the DRO estimator is defined in termsof the bounded rationality loss, while the VI and BP estimators are computed as in Section 7.1.A and arethus not corrected for the agent’s bounded rationality. The impact of the signal and response dimensions onthe out-of-sample suboptimality and predictability risk are reported in Tables 3 and 4, respectively. Unlessotherwise stated, all experiments are parameterized exactly as in Section 7.1.A. Here, the DRO estimatorsystematically attains the lowest suboptimality risk, while the VI estimator almost always wins in terms ofpredictability risk, even though only by a small margin. Learning a Quadratic Hypothesis
A fundamental problem in marketing is to understand the purchasing decisions of consumers, which isan essential prerequisite for estimating demand functions. In this section we study the inverse optimization
Table 4.
Out-of-sample predictability risk in the presence of bounded rationality mn Methods 10 20 30 40 5010 VI 8 . × − . × − . × − . × − . × − DRO 7 . × − . × − . × − . × − . × −
20 VI 1 . × . × . × − . × − . × − DRO 1 . × . × . × − . × − . × −
30 VI 1 . × . × . × . × . × − DRO 1 . × . × . × . × . × −
40 VI 1 . × . × . × . × . × DRO 1 . × . × . × . × . ×
50 VI 1 . × . × . × . × . × DRO 1 . × . × . × . × . × problem of a marketing manager (the observer) aiming to learn the quadratic utility function that bestexplains the purchasing decisions of a consumer (the agent). This problem setup is inspired by [27].7.2.B. Consistent Noisy Measurements
Decision problem of the agent:
Assume that there are n products with prices s ∈ R n + . The agent aimsto select a basket of goods x ∈ R n + that minimizes the true objective function F ( s, x ) = (cid:10) s, x (cid:11) − U ( x ), where (cid:10) s, x (cid:11) represents the purchasing costs, while the concave quadratic function U ( x ) := − (cid:10) x, Q (cid:63)xx x (cid:11) − (cid:10) q (cid:63) , x (cid:11) captures the utility of consumption. The positive definite matrix Q (cid:63)xx is constructed as follows. Sample asquare matrix A uniformly form [ − , n × n , denote by R the orthogonal matrix consisting of the orthonormaleigenvectors of ( A + A (cid:124) ) / Q (cid:63)xx := R (cid:124) DR , where D is a diagonal matrix whose main diagonal issampled uniformly form [0 . , n . Moreover, the gradient q (cid:63) is sampled uniformly from [ − , n . Finally,define the agent’s feasible set as X ( s ) = [0 , n , which can be brought to the standard form of Assumption 5.1by setting W = ( I , − I ) (cid:124) ∈ R n × n , H = (0 , (cid:124) ∈ R n × m , h = 5 · (0 , − (cid:124) ∈ R n and K = R n + . As usual, for a fixed signal s , we denote the optimal value of the agent’s true decision problem by z (cid:63) ( s ). Generation of training samples:
Signals follow the uniform distribution on the support set S = [0 , n ,which can be brought to the standard form of Assumption 5.1 by setting C = ( I , − I ) (cid:124) ∈ R n × n , d = (0 , − (cid:124) ∈ R n and C = R n + . As in Section 7.1.A, we assume that the agent’s response to s is noisy and constitutes arandom δ -suboptimal solution to (1) for δ = 0 .
2. Thus, x is obtained as a solution of the auxiliary problemmin x ∈ X ( s ) (cid:8)(cid:10) x, Q rand x (cid:11) : (cid:10) x, Q (cid:63)xx x (cid:11) + (cid:10) s, x (cid:11) + (cid:10) q (cid:63) , x (cid:11) ≤ z (cid:63) ( s ) + δ (cid:9) , which minimizes a convex quadratic cost over the set of all δ -suboptimal solutions to (1). Specifically, Q rand is a diagonal matrix whose main diagonal is sampled uniformly form [0 , n . As ( s, x ) ∈ Ξ by construction,the measurement noise is consistent with the know support of the exact signal-response pairs. The imperfectconsistent training samples ( (cid:98) s i , (cid:98) x i ), i ≤ N , are now generated independently using the above procedure. Decision problem of the observer:
The observer aims to identify the best quadratic hypothesis of theform F θ ( s, x ) := (cid:10) x, Q xx x (cid:11) + (cid:10) x, s (cid:11) + (cid:10) q, x (cid:11) , where the parameter θ = ( Q xx , q ) ranges over the search spaceΘ = (cid:110) θ = ( Q xx , q ) ∈ R n × n × R n : Q xx (cid:23) (cid:111) . Note that no hypothesis F θ ( s, x ), θ ∈ Θ, can vanish identically due to the term (cid:10) x, s (cid:11) . Note also that theagent’s true objective function corresponds to θ (cid:63) = ( Q (cid:63)xx , q (cid:63) ) ∈ Θ. We assume that the observer minimizes
ATA-DRIVEN INVERSE OPTIMIZATION WITH IMPERFECT INFORMATION 25 -4 -3 -2 -1 ε S ubop t i m a li t y P r ed i c t ab ili t y (a) Imperfect consistent training sam-ples and perfect test samples -4 -3 -2 -1 ε S ubop t i m a li t y P r ed i c t ab ili t y (b) Imperfect inconsistent trainingsamples and perfect test samples -4 -3 -2 -1 ε S ubop t i m a li t y P r ed i c t ab ili t y (c) Model uncertainty Figure 3.
Impact of the Wasserstein radius on the out-of-sample suboptimality and pre-dictability riskthe suboptimality loss (4b), uses the expected value to measure risk and solves the distributionally robustinverse optimization problem (13) over a 2-Wasserstein ball around the empirical distribution on the trainingsamples, where the 2-norm is used as the transportation cost on Ξ. By Theorem 6.3, the emerging inverseoptimization problem is conservatively approximated by the tractable semidefinite program (24).
Out-of-sample risk:
The quality of an estimator (cid:98) θ N obtained from (24) is measured by its out-of-samplerisk E P out ( (cid:96) (cid:98) θ N ) both with respect to the predictability loss (4a) and the suboptimality loss (4b), where P out represents the distribution of a single test sample ( s, x ) independent of the training samples, and where x isan exact (non-noisy) response to s in (1). More precisely, the out-of-sample risk is evaluated approximatelyusing 1,000 independent test samples from P out . Results:
All numerical results are averaged across 100 independent problems instances { Q (cid:63)xx , q (cid:63) } . Underthe assumption that the signal and response dimensions are set to n = 10 and there are N = 20 training sam-ples, Figure 3(a) shows how the out-of-sample risk of the optimal estimator (cid:98) θ N ( ε ) obtained from (13) changeswith the Wasserstein radius ε . As in Section 7.1.A, this experiment suggests that both the predictability andsuboptimality risks can be reduced by using a distributionally robust inverse optimization model.7.2.B. Inconsistent Noisy Measurements
Assume now that the agent’s optimal solutions to (1) are corrupted by additive measurement noise thatfollows a uniform distribution on [ − . , . n . Otherwise, we consider the exact same experimental setup asin Section 7.2.B. Under this premise, it is likely that some training responses are infeasible in (1), that is, (cid:98) x i / ∈ X ( (cid:98) s i ) and— a fortiori —( (cid:98) s i , (cid:98) x i ) / ∈ Ξ for some i ≤ N . These problematic training samples are inconsistentwith the known support of the perfect signal-response pairs, and the corresponding empirical distribution (cid:98) P N fails to be supported on Ξ. Thus, for all sufficiently small values of ε there exists no distribution Q on Ξwith W (cid:0) Q , (cid:98) P N (cid:1) ≤ ε , implying that the Wasserstein ball B ε ( (cid:98) P N ) is empty, in which case the distributionallyrobust inverse optimization problem (24) becomes meaningless. Figure 3(b) visualizes the out-of-samplesuboptimality and predictability risk of the optimal estimator (cid:98) θ ( ε ) as a function of ε . Note that for any fixed ε the figure reports the out-of-sample risk averaged only across those problem instances { Q (cid:63)xx , q (cid:63) } for which B ε ( (cid:98) P N ) (cid:54) = ∅ . Inspecting the results at the instance level, we observe that the out-of-sample risk is typicallyminimized by the smallest value of ε ≥ B ε ( (cid:98) P N ) (cid:54) = ∅ (this is not evident from the aggregate resultsshown in Figure 3(b)). We conclude that a distributionally robust approach may be necessary for consistency. Model Uncertainty
Assume next that the agent’s objective function is not contained in the set of hypotheses F θ ( s, x ), θ ∈ Θ,but that the signals and the agent’s responses are unaffected by noise. Specifically, in analogy to [27], weassume that the true utility function is given by U ( x ) := (cid:10) , √ Ax − b (cid:11) , where A is a diagonal matrix whosemain diagonal is sampled uniformly from [0 . , n , while b is sampled uniformly from [0 , . n . The squareroot is applied componentwise and evaluates to −∞ for negative arguments. Otherwise, we consider theexact same experimental setup as in Section 7.2.B. Figure 3(c) shows the out-of-sample suboptimality andpredictability risk of the optimal estimator (cid:98) θ ( ε ) as a function of ε , indicating that the best results are obtainedfor strictly positive Wasserstein radii, which enable the observer to combat over-fitting to the training samples.7.2.B. Comparison of Different Data-Driven Inverse Optimization Schemes
In practice, the Wasserstein radii offering the lowest out-of-sample risk must be estimated from the trainingsamples only. To this end, the DRO approach calibrates ε via k -fold cross validation as in Section 7.1. Anotherestimator is obtained by solving the empirical risk minimization (ERM) problem (8) using the empirical meanand the suboptimality loss (4b). This approach effectively mimics the DRO approach but sets ε = 0, whichcan be viewed as a trivial data-driven strategy to calibrate the Wasserstein radius. The VI approach alsodisregards ambiguity and minimizes the empirical first-order loss by solving the semidefinite programminimize 1 N N (cid:88) i =1 | r i | subject to (cid:10) W (cid:98) x i − H (cid:98) s i − h, γ i (cid:11) ≤ r i ∀ i ≤ NW (cid:62) γ i = 2 Q xx x + (cid:98) s i + q ∀ i ≤ Nγ i ≥ ∀ i ≤ NQ xx (cid:23) , see [10, Theorem 3]. As in Section 7.1.A, the absolute values of the residuals r i in the objective can bedropped whenever the training samples are consistent with Ξ. We exclude the BP approach from thisexperiment because it leads to severely intractable mixed integer semidefinite programming problems.All three approaches described above search over the parametric space of quadratic hypotheses. If theobserver suspects that the true utility function fails to be quadratic, however, she may prefer a non-parametric approach that models the gradient of the utility function as a vector field f ∈ H n , where H representsa reproducing kernel Hilbert space of real-valued functions on R n + , which is induced by a symmetric andpositive definite kernel function k : R n + × R n + → R . As H n is typically infinite-dimensional and containsmultiple candidate gradients f ∈ H n that explain the training data, it has been suggested in [10, Section 5]to minimize the Hilbert norm (cid:80) ni =1 (cid:107) f i (cid:107) H of f subject to the constraint that the residuals of the first-orderoptimality condition at the training samples satisfy N (cid:80) Ni =1 | r i | ≤ κ for some prescribed threshold κ ≥ k ) candidate gradient f ∈ H n that explains the training data to within accuracy κ . By leveraging a generalized representer theorem, theresulting infinite-dimensional optimization problem can be reformulated as the tractable quadratic programminimize n (cid:88) i =1 (cid:10) e i , αKα (cid:62) e i (cid:11) subject to (cid:10) W (cid:98) x i − H (cid:98) s i − h, γ i (cid:11) ≤ r i ∀ i ≤ NW (cid:62) γ i = (cid:98) s i − αKe i ∀ i ≤ Nγ i ≥ ∀ i ≤ N N N (cid:88) i =1 | r i | ≤ κ, (33a) We did not consider the ERM approach in Section 5 because it coincides with the VI approach for linear hypotheses.
ATA-DRIVEN INVERSE OPTIMIZATION WITH IMPERFECT INFORMATION 27 where K ∈ R N × N denotes the kernel matrix with entries K ij = k ( (cid:98) x i , (cid:98) x j ), e i stands for the i -th standardbasis vector in a space of appropriate dimension, and α ∈ R n × N denotes a decision variable that encodes thepartial derivatives of the utility function via f i ( x ) = (cid:80) Nj =1 α ij k ( (cid:98) x j , x ); see [10, Theorem 5].In the inverse optimization context studied here, however, the above non-parametric VI approach suffersfrom two shortcomings that are not addressed in [10].(i) The inverse optimization problem (33a) aims to learn the gradient field f of the unknown utilityfunction U . As the Hessian matrix of U must be symmetric, the gradient field f must satisfy ∂f i ( x ) ∂x j = ∂f j ( x ) ∂x i ∀ x ∈ R n + , ∀ i, j = 1 , . . . , n. By Stokes’ theorem, this condition is necessary and sufficient to ensure that the utility function U can be reconstructed uniquely (modulo an additive constant) from f . Specifically, this conditionguarantees that the utility function is defined unambiguously through the line integral U ( x ) = U (0) + (cid:82) C (cid:10) f ( x (cid:48) ) , d x (cid:48) (cid:11) , where C represents an arbitrary piecewise smooth curve in R n + connecting 0 and x .(ii) While the inverse optimization problem (33a) represents a tractable quadratic program, the resultinggradient field f may induce a non-concave utility function U , which means that the agent’s objectivefunction F ( s, x ) = (cid:10) s, x (cid:11) − U ( x ) may have multiple local minima. Thus, even though learning f iseasy, predicting the optimal response x to a given signal s may require the solution of a hard non-convex optimization problem, which may severely complicate extensive out-of-sample tests. Similarly,evaluating the suboptimality loss F ( s, x ) at a fixed signal-response pair is intractable.Here we address the first challenge by using the polynomial kernel function k ( x, x (cid:48) ) = ( c (cid:10) x, x (cid:48) (cid:11) + 1) p , whichallows us to include the missing symmetry conditions in (33a) by appending the linear equality constraints(33b) N (cid:88) k =1 (cid:16) α ik [ (cid:98) x k ] j − α jk [ (cid:98) x k ] i (cid:17) q − (cid:89) t =1 [ (cid:98) x k ] l t = 0 ∀ i, j = 1 , . . . , n, ∀ ≤ l ≤ · · · ≤ l q − ≤ n, ∀ q = 1 , . . . , p. The second challenge does not have a simple remedy, and therefore we abandon the ideal goal to solve (1) toglobal optimality. Instead, for any given signal s , we use the gradient field f obtained from (33) directly tofind a local solution of (1) via the classical subgradient descent algorithm [16, Chapter 3], where the step sizeis set to 10 − and an initial feasible solution is selected uniformly at random from X ( s ). Note that the focuson local minima in prediction is consistent with the use of the first-order loss (4c) in inverse optimizationbecause the first-oder loss cannot distinguish local and global optima and thus fails to penalize trainingsamples ( (cid:98) s i , (cid:98) x i ) where (cid:98) x i is a locally optimal but globally suboptimal response to (cid:98) s i .In our experiments we determine the parameter c of the polynomial kernel via 5-fold cross validation. Oncethe gradient field f has been inferred from (33), we construct the corresponding utility function by integrating f along the straight line C between 0 and x with parameterization g ( t ) = tx for t ∈ [0 , U ( x ) = U (0) + (cid:90) C (cid:10) f ( x (cid:48) ) , d x (cid:48) (cid:11) = U (0) + (cid:90) (cid:10) f ( tx ) , x (cid:11) d t = U (0) + (cid:90) t n (cid:88) i =1 N (cid:88) j =1 x i α ij k ( (cid:98) x j , tx ) d t = U (0) + n (cid:88) i =1 N (cid:88) j =1 x i α ij (cid:90) ( c (cid:10) tx, (cid:98) x j (cid:11) + 1) p d t = U (0) + n (cid:88) i =1 N (cid:88) j =1 x i α ij (cid:0) c (cid:10) x, (cid:98) x j (cid:11) + 1 (cid:1) p +1 − c ( p + 1) (cid:10) x, (cid:98) x j (cid:11) . Table 5 reports the out-of-sample suboptimality and predictability risks, respectively, for the DRO, VI andERM estimators. The non-parametric VI approach is exclusively used in the presence of model uncertainty,which is the only scenario in which it has a chance to outperform the more parsimonious parametric ap-proaches. Our results show that the DRO estimator consistently attains the lowest suboptimality and pre-dictability risk among all parametric approaches. We emphasize that the suboptimality and predicatabilityrisk of the non-parametric VI approach are evaluated with respect to the local minimum identified by thesubgradient descent algorithm and are therefore largely meaningless. In fact, we observed that the subop-timality risk becomes even negative for certain instances in which the global minimum of (1) could not be
Table 5.
Data-driven inverse optimization: Out-of-sample suboptimality and predictabilityrisk of the VI, ERM and DRO approaches in different experimental settingsMethods Suboptimality PredictabilityConsistent noisy VI (Parametric) 3 . × − . × measurements ERM 8 . × − . × − DRO 7 . × − . × − Inconsistent noisy VI (Parametric) 9 . × − . × − measurements ERM unbounded unboundedDRO 4 . × − . × − Model uncertainty VI (Parametric) 8 . × − . × VI (Non-parametric, p = 2) 1 . × . × VI (Non-parametric, p = 3) 4 . × − . × ERM 6 . × − . × DRO 6 . × − . × found. This artefact explains the low suboptimality risk of the non-parametric VI approach with p = 3. Notealso that for p ≥ Acknowledgments:
This work was supported by the Swiss National Science Foundation grant BSCGI0 157733.
Appendix A. Convex Hypotheses
Consider the class F of hypotheses of the form F θ ( s, x ) := (cid:10) θ, Ψ( x ) (cid:11) , where each component function ofthe feature map Ψ : R n → R d is convex, and where the weight vector θ ranges over a convex closed searchspace Θ ⊆ R d + . Thus, by construction, F θ ( s, x ) is convex in x for every fixed θ ∈ Θ. In the remainder we willassume without much loss of generality that the transformation from the signal-response pair ( s, x ) to thesignal-feature pair ( s, Ψ( x )) is Lipschitz continuous with Lipschitz modulus 1, that is, we require(34) (cid:107) ( s , Ψ( x )) − ( s , Ψ( x )) (cid:107) ≤ (cid:107) ( s , x ) − ( s , x ) (cid:107) ∀ ( s , x ) , ( s , x ) ∈ R m × R n . Before devising a safe conic approximation for the inverse optimization problem (13) with convex hypotheses,we recall that the conjugate of a function f : R n → R is defined through f ∗ ( z ) = sup x ∈ R n (cid:10) z, x (cid:11) − f ( x ). Theorem A.1 (Convex hypotheses and suboptimality loss) . Assume that F represents the class of convexhypotheses induced by the feature map Ψ and with a convex closed search space Θ ⊆ R d + and that Assump-tion 5.1 holds. If the observer uses the suboptimality loss (4b) and measures risk using the CVaR at level α ∈ (0 , , then the following convex program provides a safe approximation for the distributionally robust ATA-DRIVEN INVERSE OPTIMIZATION WITH IMPERFECT INFORMATION 29 inverse optimization problem (13) over the -Wasserstein ball: minimize τ + 1 α (cid:32) ελ + 1 N N (cid:88) i =1 r i (cid:33) subject to θ ∈ Θ , λ ∈ R + , τ, r i ∈ R , φ i , φ i ∈ C ∗ , γ i ∈ K ∗ , z ij ∈ R n ∀ i ≤ N, ∀ j ≤ d d (cid:88) j =1 θ j Ψ ∗ j ( z ij /θ j ) + (cid:10) θ, Ψ( (cid:98) x i ) (cid:11) + (cid:10) φ i , C (cid:98) s i − d (cid:11) − (cid:10) γ i , H (cid:98) s i + h (cid:11) ≤ r i + τ ∀ i ≤ N d (cid:88) j =1 z ij = W (cid:62) γ i ∀ i ≤ N (cid:10) C (cid:98) s i − d, φ i (cid:11) ≤ r i ∀ i ≤ N (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:32) H (cid:62) γ i − C (cid:62) φ i θ (cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗ ≤ λ, (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:32) C (cid:62) φ i (cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗ ≤ λ ∀ i ≤ N. (35)Note that Theorem A.1 remains valid if ( (cid:98) s i , (cid:98) x i ) / ∈ Ξ for some i ≤ N . Proof of Theorem A.1.
As in the proof of Theorem 5.2 one can show that the objective function of the inverseoptimization problem (13) is equivalent to (19). In the remainder, we derive a safe conic approximation forthe (intractable) subordinate worst-case expectation problemsup Q ∈ B ε ( (cid:98) P N ) E Q (cid:2) max { (cid:96) θ ( s, x ) − τ, } (cid:3) . (36)To this end, note that the suboptimality loss (cid:96) θ ( s, x ) = (cid:10) θ, Ψ( x ) (cid:11) − min y ∈ X ( s ) (cid:10) θ, Ψ( y ) (cid:11) depends on x onlythrough Ψ( x ). This motivates us to define a lifted suboptimality loss (cid:96) Ψ θ ( s, ψ ) = (cid:10) θ, ψ (cid:11) − min y ∈ X ( s ) (cid:10) θ, Ψ( y ) (cid:11) ,where ψ represents an element of the feature space R d , and the empirical distribution (cid:98) P Ψ N = N (cid:80) Ni =1 δ ( (cid:98) s i , Ψ( (cid:98) x i )) of the signal-feature pairs. Moreover, we denote by B ε ( (cid:98) P Ψ N ) the 1-Wasserstein ball of all distributions on S × R d that have a distance of at most ε from (cid:98) P Ψ N . In the following we will show that the worst-case expectationsup Q ∈ B ε ( (cid:98) P Ψ N ) E Q (cid:2) max { (cid:96) θ ( s, ψ ) − τ, } (cid:3) (37)on the signal-feature space S × R d provides a tractable upper bound on the worst-case expectation (36).By Definition 4.3, each distribution Q ∈ B ε ( (cid:98) P N ) corresponds to a transportation plan Π, that is, a jointdistribution of two signal-response pairs ( s, x ) and ( s (cid:48) , x (cid:48) ) under which ( s, x ) has marginal distribution Q and( s (cid:48) , x (cid:48) ) has marginal distribution (cid:98) P N . By the law of total probability, the transportation plan can be expressedas Π = N (cid:80) Ni =1 δ ( (cid:98) s i , (cid:98) x i ) ⊗ Q i , where Q i denotes the conditional distribution of ( s, x ) given ( s (cid:48) , x (cid:48) ) = ( (cid:98) s i , (cid:98) x i ), i ≤ N , see also [30, Theorem 4.2]. Thus, we the worst-case expectation (36) satisfiessup Q ∈ B ε ( (cid:98) P N ) E Q (cid:2) max { (cid:96) θ ( s, x ) − τ, } (cid:3) = sup Q i N N (cid:88) i =1 (cid:90) Ξ max { (cid:96) θ ( s, Ψ( x )) − τ, } Q i (d s, d x )s.t. 1 N N (cid:88) i =1 (cid:90) Ξ (cid:107) ( s, x ) − ( (cid:98) s i , (cid:98) x i ) (cid:107) Q i (d s, d x ) ≤ ε (cid:90) Ξ Q i (d s, d x ) = 1 ∀ i ≤ N ≤ sup Q i N N (cid:88) i =1 (cid:90) Ξ max { (cid:96) θ ( s, Ψ( x )) − τ, } Q i (d s, d x )s.t. 1 N N (cid:88) i =1 (cid:90) Ξ (cid:107) ( s, Ψ( x )) − ( (cid:98) s i , Ψ( (cid:98) x i )) (cid:107) Q i (d s, d x ) ≤ ε (cid:90) Ξ Q i (d s, d x ) = 1 ∀ i ≤ N ≤ sup Q i N N (cid:88) i =1 (cid:90) S × R d max { (cid:96) θ ( s, ψ ) − τ, } Q i (d s, d ψ )s.t. 1 N N (cid:88) i =1 (cid:90) S × R d (cid:107) ( s, ψ ) − ( (cid:98) s i , Ψ( (cid:98) x i )) (cid:107) Q i (d s, d ψ ) ≤ ε (cid:90) S × R d Q i (d s, d ψ ) = 1 ∀ i ≤ N, where the first inequality follows from (34), while the second inequality follows from relaxing the implicitcondition that the signal-feature pair ( s, ψ ) must be supported on { ( s, Ψ( x )) : ( s, x ) ∈ Ξ } ⊆ S × R d . Usinga similar reasoning as before, the last expression is readily recognized as the worst-case expectation (37).Thus, (37) provides indeed an upper bound on (36). Duality arguments borrowed from [30, Theorem 4.2]imply that the infinite-dimensional linear program (37) admits a strong dual of the forminf λ ≥ ,r i λε + 1 N N (cid:88) i =1 r i s.t. sup ( s,y ) ∈ Ξ ,ψ ∈ R d (cid:10) θ, ψ − Ψ( y ) (cid:11) − τ − λ (cid:107) ( s, ψ ) − ( (cid:98) s i , Ψ( (cid:98) x i )) (cid:107) ≤ r i ∀ i ≤ N sup s ∈ S ,ψ ∈ R d − λ (cid:107) ( s, ψ ) − ( (cid:98) s i , Ψ( (cid:98) x i )) (cid:107) ≤ r i ∀ i ≤ N. (38)By using the definitions of S and X ( s ) put forth in Assumption 5.1, the i -th member of the first constraintgroup in (38) is satisfied if and only if the optimal value of the maximization problemsup s,y,ψ (cid:10) θ, ψ − Ψ( y ) (cid:11) − τ − λ (cid:107) ( s, ψ ) − ( (cid:98) s i , Ψ( (cid:98) x i )) (cid:107) s.t. Cs (cid:23) C d, W y (cid:23) K Hs + h (39)does not exceed r i . A tedious but routine calculation shows that the dual of (39) can be represented asinf p i ,q i ,γ i ,φ i ,z ij d (cid:88) j =1 θ j Ψ ∗ j ( z ij /θ j ) − τ + (cid:10) θ, Ψ( (cid:98) x i ) (cid:11) + (cid:10) φ i , C (cid:98) s i − d (cid:11) − (cid:10) γ i , H (cid:98) s i + h (cid:11) s.t. (cid:80) dj =1 z ij = W (cid:62) γ i (cid:107) ( H (cid:62) γ i − C (cid:62) φ i , θ ) (cid:107) ∗ ≤ λγ i ∈ K ∗ , φ i ∈ C ∗ . (40)Note that the perspective functions θ j Ψ ∗ j ( z ij /θ j ) in the objective of (40) emerge from taking the conjugateof θ j Ψ j ( y ). Thus, for θ j = 0 we must interpret θ j Ψ ∗ j ( z ij /θ j ) as an indicator function in z ij which vanishesfor z ij = 0 and equals ∞ otherwise. By weak duality, (40) provides an upper bound on (39). We concludethat the i -th member of the first constraint group in (38) is satisfied whenever the dual problem (40) has a ATA-DRIVEN INVERSE OPTIMIZATION WITH IMPERFECT INFORMATION 31 feasible solution whose objective value does not exceed r i . A similar reasoning shows that the i -th memberof the second constraint group in (38) holds if and only if there exists φ i ∈ C ∗ such that (cid:10) C (cid:98) s i − d, φ i (cid:11) ≤ r i and (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:32) C (cid:124) φ i (cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗ ≤ λ. Thus, the worst-case expectation (36) is bounded above by the optimal value of the finite convex programinf ελ + 1 N N (cid:88) i =1 r i s.t. λ ∈ R + , r i ∈ R , φ i , φ i ∈ C ∗ , γ i ∈ K ∗ ∀ i ≤ N (cid:80) dj =1 θ j Ψ ∗ j ( z ij /θ j ) + (cid:10) θ, Ψ( (cid:98) x i ) (cid:11) + (cid:10) φ i , C (cid:98) s i − d (cid:11) − (cid:10) γ i , H (cid:98) s i + h (cid:11) ≤ r i + τ ∀ i ≤ N (cid:80) dj =1 z ij = W (cid:62) γ i ∀ i ≤ N (cid:10) C (cid:98) s i − d, φ i (cid:11) ≤ r i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:32) H (cid:62) γ i − C (cid:62) φ i θ (cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗ ≤ λ, (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:32) C (cid:62) φ i (cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗ ≤ λ ∀ i ≤ N. The claim then follows by substituting this convex program into (19). (cid:3)
References [1]
D. Ackerberg, C. Lanier Benkard, S. Berry, and A. Pakes , Econometric tools for analyzing market outcomes , inHandbook of Econometrics, J. Heckman and E. Leamer, eds., Elsevier, 2007, pp. 4171–4276.[2]
S. Ahmed and Y. Guan , The inverse optimal value problem , Mathematical Programming A, 102 (2005), pp. 91–110.[3]
R. K. Ahuja and J. B. Orlin , A faster algorithm for the inverse spanning tree problem , Journal of Algorithms, 34 (2000),pp. 177–193.[4] ,
Inverse optimization , Operations Research, 49 (2001), pp. 771–783.[5]
B. D. Anderson and J. B. Moore , Optimal Control: Linear Quadratic Methods , Courier Corporation, 2007.[6]
A. Aswani, Z.-J. M. Shen, and A. Siddiq , Inverse optimization with noisy data , arXiv preprint arXiv:1507.03266, (2015).[7]
P. Bajari, C. L. Benkard, and J. Levin , Estimating dynamic models of imperfect competition , tech. report, NationalBureau of Economic Research, 2004.[8]
O. Ben-Ayed and C. E. Blair , Computational difficulties of bilevel linear programming , Operations Research, 38 (1990),pp. 556–560.[9]
A. Ben-Tal, L. El Ghaoui, and A. Nemirovski , Robust Optimization , Princeton University Press, 2009.[10]
D. Bertsimas, V. Gupta, and I. C. Paschalidis , Inverse optimization: A new perspective on the Black-Litterman model ,Operations Research, 60 (2012), pp. 1389–1403.[11] ,
Data-driven estimation in equilibrium using inverse optimization , Mathematical Programming, 153 (2015), pp. 595–633.[12]
E. Boissard , Simple bounds for convergence of empirical and occupation measures in 1-Wasserstein distance , ElectronicJournal of Probability, 16 (2011), pp. 2296–2333.[13]
C. L. Bottasso, B. I. Prilutsky, A. Croce, E. Imberti, and S. Sartirana , A numerical procedure for inferring fromexperimental data the optimization cost functions using a multibody model of the neuro-musculoskeletal system , MultibodySystem Dynamics, 16 (2006), pp. 123–154.[14]
S. Boyd and L. Vandenberghe , Convex Optimization , Cambridge University Press, 2009.[15]
D. Braess, A. Nagurney, and T. Wakolbinger , On a paradox of traffic planning , Transportation Science, 39 (2005),pp. 446–450.[16]
S. Bubeck , Convex optimization: Algorithms and complexity , Foundations and Trends in Machine Learning, 8 (2015),pp. 231–357.[17]
D. Burton and P. L. Toint , On an instance of the inverse shortest paths problem , Mathematical Programming, 53 (1992),pp. 45–61.[18]
S. Carr and W. Lovejoy , The inverse newsvendor problem: Choosing an optimal demand portfolio for capacitatedresources , Management Science, 46 (2000), pp. 912–927.[19]
T. C. Chan, T. Craig, T. Lee, and M. B. Sharpe , Generalized inverse multiobjective optimization with application tocancer therapy , Operations Research, 62 (2014), pp. 680–695.[20]
A. Farag´o, ´A. Szentesi, and B. Szviatovszki , Inverse optimization in high-speed networks , Discrete Applied Mathematics,129 (2003), pp. 83–98. [21]
N. Fournier and A. Guillin , On the rate of convergence in Wasserstein distance of the empirical measure , ProbabilityTheory and Related Fields, 162 (2014), pp. 707–738.[22]
J. Friedman, T. Hastie, and R. Tibshirani , The Elements of Statistical Learning , Springer, 2001.[23]
P. Harker and J. Pang , Finite-dimensional variational inequality and nonlinear complementarity problems: A survey oftheory, algorithms and applications , Mathematical Programming, 48 (1990), pp. 161–220.[24]
C. Heuberger , Inverse combinatorial optimization: A survey on problems, methods, and results , Journal of CombinatorialOptimization, 8 (2004), pp. 329–361.[25]
D. S. Hochbaum , Efficient algorithms for the inverse spanning-tree problem , Operations Research, 51 (2003), pp. 785–797.[26]
G. Iyengar and W. Kang , Inverse conic programming with applications , Operations Research Letters, 33 (2005), pp. 319–330.[27]
A. Keshavarz, Y. Wang, and S. Boyd , Imputing a convex objective function , in IEEE International Symposium onIntelligent Control, 2011, pp. 613–619.[28]
J. Lofberg , YALMIP : A toolbox for modeling and optimization in MATLAB , in IEEE International Symposium onComputer Aided Control Systems Design, 2004, pp. 284 –289.[29]
H. M. Markowitz , Portfolio Selection: Efficient Diversification of Investments , Yale University Press, 1968.[30]
P. Mohajerin Esfahani and D. Kuhn , Data-driven distributionally robust optimization using the Wasserstein metric:Performance guarantees and tractable reformulations , Mathematical Programming, (2017), pp. 1–52.[31]
K. G. Murty and S. N. Kabadi , Some NP-complete problems in quadratic and nonlinear programming , MathematicalProgramming, 39 (1987), pp. 117–129.[32]
G. Neumann-Denzau and J. Behrens , Inversion of seismic data using tomographical reconstruction techniques for inves-tigations of laterally inhomogeneous media , Geophysical Journal International, 79 (1984), pp. 305–315.[33]
R. T. Rockafellar and S. Uryasev , Optimization of conditional Value-at-Risk , Journal of Risk, 2 (2000), pp. 21–42.[34]
J. Saez-Gallego, J. M. Morales, M. Zugno, and H. Madsen , A data-driven bidding model for a cluster of price-responsive consumers of electricity , IEEE Transactions on Power Systems, 31 (2016), pp. 5001–5011.[35]
A. J. Schaefer , Inverse integer programming , Optimization Letters, 3 (2009), pp. 483–489.[36]
S. Shafieezadeh-Abadeh, P. Mohajerin Esfahani, and D. Kuhn , Distributionally robust logistic regression , in Advancesin Neural Information Processing Systems, 2015, pp. 1576–1584.[37]
M. Sion , On general minimax theorems , Pacific Journal of Mathematics, 8 (1958), pp. 171–176.[38]
A. V. Terekhov, Y. B. Pesin, X. Niu, M. L. Latash, and V. M. Zatsiorsky , An analytical approach to the problemof inverse optimization with additive objective functions: An application to human prehension , Journal of MathematicalBiology, 61 (2010), pp. 423–453.[39]
M. D. Troutt, W.-K. Pang, and S.-H. Hou , Behavioral estimation of mathematical programming objective functioncoefficients , Management Science, 52 (2006), pp. 422–434.[40]
L. Wang , Cutting plane algorithms for the inverse mixed integer linear programming problem , Operations Research Letters,37 (2009), pp. 114–116.[41]
J. H. Woodhouse and A. M. Dziewonski , Mapping the upper mantle: Three-dimensional modeling of earth structure byinversion of seismic waveforms , Journal of Geophysical Research, 89 (1984), pp. 5953–5986.[42]