[PDF] Optimal Transport in the Face of Noisy Data

Abstract

Optimal transport distances are popular and theoretically well understood in the context of data-driven prediction. A flurry of recent work has popularized these distances for data-driven decision-making as well although their merits in this context are far less well understood. This in contrast to the more classical entropic distances which are known to enjoy optimal statistical properties. This begs the question when, if ever, optimal transport distances enjoy similar statistical guarantees. Optimal transport methods are shown here to enjoy optimal statistical guarantees for decision problems faced with noisy data.

Full PDF

aa r X i v : . [ m a t h . O C ] F e b Optimal Transport in the Face of Noisy Data

Bart P.G. Van Parys ∗ Sloan School of Management, MIT

February 9, 2021

Abstract

Optimal transport distances are popular and theoretically well understood in the context of data-drivenprediction. A ﬂurry of recent work has popularized these distances for data-driven decision-making as wellalthough their merits in this context are far less well understood. This in contrast to the more classicalentropic distances which are known to enjoy optimal statistical properties. This begs the question when, ifever, optimal transport distances enjoy similar statistical guarantees. Optimal transport methods are shownhere to enjoy optimal statistical guarantees for decision problems faced with noisy data.

Let P be a family of probability measures over a space Ξ and let P be an unknown probability measure inthis family. Many problems in the machine learning community attempt at heart to learn the unknown datagenerating probability measure from a ﬁnite collection of independent data observations ξ i ∼ P ∀ i ∈ [1 , . . . , N ] . (1)The most obvious application of this class of learning problems is perhaps density estimation [24]. Dependingon whether the family P is the entire probability simplex P (Ξ) over Ξ or merely a parametrized subset suchmethods can be classiﬁed as nonparametric or parametric, respectively. That is not to say that all machinelearning problems with the data observation model (1) are density estimation problems in disguise. Often, oneis not interested in the probability measure P per se but rather only a certain aspect. In predictive problems, inwhich the data consists of both dependent and independent variables, typically only the conditional expectationof the dependent variable given the independent variables is of interest. In prescriptive problems, which are herethe problems of primary interest, the problem of immediate concern is to ﬁnd an 0 < ǫ -suboptimal solution toa stochastic optimization problem, i.e., z ( P ) ∈ arg ǫ inf z ∈ Z { E P [ ℓ ( z, ξ )] = R ℓ ( z, ξ ) d P ( ξ ) } . (2)A wide spectrum of decision problems can be cast as instances of (2). Shapiro et al. [23] point out, that (2) canbe viewed as the ﬁrst stage of a two-stage stochastic program, where the loss function ℓ : Z × Ξ → R embodiesthe optimal value of a subordinate second-stage problem. Alternatively, problem (2) may also be interpretedas a generic learning problem in the spirit of statistical learning theory. Rather than learning the unknownprobability measure P , the primary objective in data-driven decision-making is to learn the cost function andperhaps even better an ǫ -suboptimal decision to (2) directly from data. ∗ [email protected] Here arg ǫ inf z ∈ Z E P [ ℓ ( z, ξ )] with ǫ > { z ∈ Z : E P [ ℓ ( z, ξ )] < inf z ∈ Z E P [ ℓ ( z, ξ )] + ǫ } .

1n this paper we will denote the observational model described in Equation (1) as “noiseless”. That is, thelearner has access to uncorrupted independent samples from the probability measure of interest P . Clearly,given any ﬁnite amount of data the learner can not expect to learn the data generating probability measureexactly even in this noiseless regime. In case the probability measure P is only known to belong to the probabilitysimplex P (Ξ), one reasonable substitute for P could be its empirically observed counterpart denoted here as P N := P Ni =1 δ ξ i /N . If on the other hand some prior information is available in the sense that the probabilitymeasure P is known to belong to a subset P ⊂ P (Ξ), a maximum likelihood estimate [13] is often used instead.In the machine learning and robust optimization community such point estimates are widely known to beproblematic when used naively in subsequent analysis. In particular, it is widely established both empiricallyas well as in theory that a sample average formulation z ( P N ) ∈ arg ǫ inf z ∈ Z E P N [ ℓ ( z, ξ )] (3)which substitutes P with a mere point estimate P N tends to disappoint ( E P [ ℓ ( z ( P N ) , ξ )] > E P N [ ℓ ( z ( P N ) , ξ )])out of sample. That is, the actual cost observed out of sample breaks the predicted cost of the data-drivendecision z ( P N ). This adversarial phenomenon is well known colloquially as the “Optimizer’s Curse” [16] and isakin to the overﬁtting phenomenon in the context of prediction problems. Such adversarial phenomena relatedto over-calibration to observed data but poor performance on out-of-sample data can be attributed primarilyto the treatment of mere point estimates as exact substitutes for the unknown probability measure.Ambiguity sets consisting of all probability measures suﬃciently compatible with the observed data can oﬀera better alternative to simple point estimates. As the data observations are here independent and identicallydistributed, their order is irrelevant, and ambiguity sets A N ( P N ) ⊆ P can be made functions of the empiricalprobability measure P N rather than the data itself. A large line of work in the robust optimization community,see [20] and references therein, focuses consequently on data-driven formulations of the form z A ( P N ) ∈ arg ǫ inf z ∈ Z sup { E P [ ℓ ( z, ξ )] : P ∈ A N ( P N ) } which can be thought of as robust counterparts to the nominal sample average formulation stated in Equation(3). Robust formulations guard against over-calibrated decisions by forcing any decision to do well on alldistributions suﬃciently compatible with the observed data as opposed to only a single point estimate. Therecent uptick in popularity of robust formulations is in no small part due to the fact that they are often justas tractable and typically enjoy superior statistical properties than their nominal counterparts. Much of theearly literature [6, 28, 26] focused on ambiguity sets consisting of probability measures sharing certain givenmoments. More recent approaches [4] however consider ambiguity sets A N ( P N ) = { P ∈ P : D ( P N , P ) ≤ r N } which are based on a statistical distance D : P (Ξ) × P (Ξ) → R ∪ { + ∞} instead. The latter ambiguity sets canhence be interpreted as the set of probability measures suﬃciently close to the empirical probability measure P N . Two qualitatively diﬀerent statistical distances have recently positioned themselves as the front runnersfor data-driven decision-making and are now brieﬂy discussed.Optimal transport distances [10, 12] have received a lot of attention both in the context of data-driven decision-making as well as in the machine learning community at large [15, 18]. In the context of data-driven decision-making optimal transport distances have become very popular after [10] pointed out that their resulting robustformulations need not be intractable. Furthermore, the associated optimal transport ambiguity sets enjoyan interpretation as conﬁdence intervals [11] for the unknown probability measure P when the radii r N arejudiciously chosen as function of the number N of observed data points. Perhaps the main competitor tooptimal transport distances is the Kullback-Leibler or entropic divergence. We brieﬂy recall the deﬁnition of2ntropic divergence between two measures µ and ν on the same space as D KL ( µ, ν ) = R log (cid:16) d µ d ν (cid:17) d µ if µ ≪ ν, + ∞ otherwise . The entropic divergence is a particular member of the class of convex f -divergences which are well known [14]to yield tractable robust formulations. Interestingly, its associated ambiguity sets do not generally admit aninterpretation as a conﬁdence interval for the unknown probability measure. Unless the event set Ξ is ﬁnite,the associated entropic ambiguity set { P ∈ P : D KL ( P N , P ) ≤ r N } does indeed not contain any continuousprobability measure and hence also not necessarily P . Despite this observation, the interval[min P ∈A N ( P N ) E P [ ℓ ( z, ξ )] , max P ∈A N ( P N ) E P [ ℓ ( z, ξ )]]nevertheless admits an interpretation [8] as a conﬁdence interval for the unknown cost E P [ ℓ ( z, ξ )] for anydecision z ∈ Z when the radii r N are judiciously chosen. Perhaps even more surprising, the associated entropicrobust prescriptive formulation z KL ( P N ) ∈ arg ǫ inf z ∈ Z sup { E P [ ℓ ( z, ξ )] : P ∈ P , D KL ( P N , P ) ≤ r } (4)can be shown [27] to enjoy optimal large deviation properties of a similar nature to those we will encounter inSection 5.The previous discussion naturally begs the question when – if ever – optimal transport distances enjoy similarstatistical guarantees in the context of prescriptive problems. Folklore belief suggests that optimal transportmethods derive their superior empirical performance from their ability to guard against noisy or corrupteddata. For instance, optimal transport methods can be interpreted as maximum likelihood estimation [21] in thecontext of predictive problems. This note indicates that optimal transport methods are similarly well suitedfor prescriptive problems facing noisy data. In that sense this note hopes to oﬀer a theoretical justiﬁcationof the perhaps surprising eﬀectiveness and popularity of optimal transport methods for data-driven decision-making. We also show that any perceived dichotomy between entropic and optimal transport distances in thecontext of data-driven decision-making is in fact a false one and that a balance of both distances is better thaneither distance separately. Hence, we argue that entropic and optimal transport distance formulations shouldbe perceived as complementary rather than as direct competitors for data-driven decision-making. Organization

In Section 2 we brieﬂy recall the (entropic) optimal transport distance and its properties. Weintroduce our noisy data model and provide three illustrative examples in Section 3. In Section 4 we prove thatthe empirical distribution of noisy observational data satisﬁes a large deviation principle with a rate functionwhich balances entropic divergence and entropic optimal transport distances. Finally, Section 5 illustrates thepower of optimal transport distances in the context of both hypothesis testing and data-driven decision-making.

Topology

We will assume that Ξ and Ξ ′ are Polish topological spaces and hence so is the product space Ξ × Ξ ′ when equipped with the product topology. We denote with P (Ξ), P (Ξ ′ ) and P (Ξ × Ξ ′ ) as the sets of all Borelprobability measures on the spaces Ξ, Ξ ′ and Ξ × Ξ ′ , respectively. Following [7, Section 6.2] the probabilitysimplices P (Ξ), P (Ξ ′ ) and P (Ξ × Ξ ′ ) when equipped with the topology of weak convergence of probabilitymeasures are Polish spaces too. We denote with D M ′ : P (Ξ ′ ) × P (Ξ ′ ) → R + a metric compatible with the weaktopology on P (Ξ ′ ). Finally, we take Z to be a ﬁnite dimensional linear vector space equipped with its classicalnorm topology. A classical choice is to take D M ′ : P (Ξ ′ ) × P (Ξ ′ ) → [0 ,

1] as the L´evy-Prokhorov [19] metric on P (Ξ ′ ). Optimal Transport Distances

Given two measures µ and ν we will denote with µ ⊗ ν their product measure. Conversely, given a probabilitymeasure T on P (Ξ × Ξ ′ ) we denote its marginal projection on P (Ξ) and P (Ξ ′ ) as Π Ξ T and Π Ξ ′ T respectively.We deﬁne T ( µ, ν ) := { T ∈ P (Ξ × Ξ ′ ) : Π Ξ T = µ, Π Ξ ′ T = ν } as the set of all joint probability measures with given marginal distributions µ and ν . Measures in the set T can be interpreted to transport marginal µ to marginal ν . Furthermore, this set is nonempty as clearly we have µ ⊗ ν ∈ T ( µ, ν ). Deﬁnition 2.1 (Entropic Optimal Transport Distance) . Given a distance function d : Ξ × Ξ ′ → R ∪ { + ∞} .The entropic optimal transport distance between µ on Ξ and ν on Ξ ′ is deﬁned as D W ( µ, ν ) := inf T ∈T ( µ,ν ) Z d ( ξ, ξ ′ ) d T ( ξ, ξ ′ ) + D KL ( T, Π Ξ T ⊗ Π Ξ ′ T ) . Note that we do not explicitly require the spaces Ξ and Ξ ′ to coincide here. Assume for a moment however thatΞ = Ξ ′ = R n and d ( ξ ′ , ξ ) = k ξ ′ − ξ k . Then, the entropic optimal transport distance modulo the regularizationterm D KL ( T, Π Ξ T ⊗ Π Ξ ′ T ) coincides with the classical Wasserstein distance between probability measures[22]. We remark that the entropic optimal transport distance is not a metric although it still enjoys distance-like properties [5]. Historically, the entropy term D KL ( T, Π Ξ T ⊗ Π Ξ ′ T ) has been considered primarily for itsbeneﬁcial smoothing eﬀect as indeed it allows the entropic optimal transport distance to be computed eﬃcientlyusing the Sinkhorn Algorithm [5] in case the event sets Ξ = Ξ ′ are ﬁnite. As for any T ∈ T ( µ, ν ) we have D KL ( T, Π Ξ T ⊗ Π Ξ ′ T ) = D KL ( T, µ ⊗ ν ), the regularization term can be interpreted to encourage transportationplans which are not too diﬀerent from the independent product coupling µ ⊗ ν . In deconvolution problems the learner attempts to estimate P on the basis of noisy observations, i.e., ξ ′ i ∼ i . i . d . O ξ i ∀ i ∈ [1 , . . . , N ] , (5)where ξ , . . . , ξ N are independent and identically distributed according to P . Hence, rather than having directaccess to samples ξ i ∈ Ξ from the probability measure of interest, the learner must do with indirect noisy data ξ ′ i ∈ Ξ ′ instead. We do assume here that the observational map O is known. Furthermore, for some of ourresults the observational process will be required to be continuous in the sense of Assumption 3.1. We remarkthat this assumption is without much loss of generality. In particular, if Ξ and Ξ ′ are ﬁnite Assumption 3.1 istrivially satisﬁed. Assumption 3.1.

The map O : Ξ → P (Ξ ′ ) is absolutely continuous with respect to a base measure m ′ , i.e., O ξ ≪ m ′ for all ξ ∈ Ξ. Consequently, there exists a density function d : Ξ × Ξ ′ → R ∪ { + ∞} so thatd O ξ d m ′ ( ξ ′ ) = exp( − d ( ξ, ξ ′ )) ∀ ξ ′ ∈ Ξ ′ . The relationship between the probability measure P ′ of the noisy observations ξ ′ i and the probability measure P of the unobserved noiseless data ξ i can be characterized as the convolution P ′ ( B ) = ( O ⋆ P )( B ) := Z O ξ ( B )d P ( ξ )for all measurable sets B ∈ B (Ξ ′ ). Clearly, the unknown probability measure P is identiﬁable from its4ounterpart P ′ only if this convolution transformation is invertible. We will denote with the set P ′ := { O ⋆ P ∈ P (Ξ ′ ) : P ∈ P} the family of potential distributions of our noisy data. We conclude this sectionby pointing out that the presented observational model is quite ﬂexible and captures a wide variety of interest-ing settings. Example 3.2 (Noiseless Data) . The choice O ξ = δ ξ for all ξ ∈ X can be identiﬁed with a noiseless observationregime where δ ξ denotes the Dirac measure at ξ , i.e., δ ξ ( B ) = { ξ ∈ B } for all B ∈ B (Ξ ′ ). Here, the probabilitymeasure P ′ of the observation ξ ′ , . . . , ξ ′ N coincides with the unknown probability measure P . That is, P ′ ( B ) = ( O ⋆ P )( B ) := Z { ξ ∈ B } d P ( ξ ) = P ( B )for all measurable sets B ∈ B (Ξ ′ ). This setting corresponds to the observational setting described by Equation(1) in which data sampled from the unknown probability measure is observed directly uncorrupted by any noise. Example 3.3 (Irrelevant Data) . The case O ξ = P ′ for some probability measure P ′ on P (Ξ ′ ) represents asetting in which the data is irrelevant. Here the observations are independent of the unknown probabilitymeasure P and consequently are wholly irrelevant. Indeed, the distribution of the noisy data is independent of P and given as P ′ ( B ) = ( O ⋆ P )( B ) := Z P ′ ( B )d P ( ξ ) = P ′ ( B )for all measurable subsets B ∈ B (Ξ ′ ). Clearly, under these circumstances the unknown probability measure P is simply not identiﬁable from the observed data unless P is a singleton in which case the learning problem istrivial. Example 3.4 (Gaussian Noise) . Most practical examples are situated somewhere between the previouslydiscussed corner cases. For the sake of exposition we will consider the case of Gaussian noise as a ﬁnal example.Let Assumption 3.1 hold here with Ξ = Ξ ′ = R n , and that d ( ξ, ξ ′ ) = k ξ − ξ ′ k / (2 σ ) with noisy power σ ≥ m ′ = µ ′ / ( σ p (2 π ) n ) with µ ′ the Lebesgue measure on R n . Consequently here Q ξ = N ( ξ, σ ) a normaldistribution with mean ξ and variance σ . The noisy data ξ ′ i in Equation (5) follows the same distribution as ξ i + z i ∀ i ∈ [1 , . . . , N ] , where ξ , . . . , ξ N and z , . . . , z N are independent and identically distributed as P and N (0 , σ ), respectively.This class of noisy observations interpolates between the noiseless regime in Example 3.2 when σ → σ → ∞ . We will attempt to infer the unknown probability measure P from our noisy data based on its empiricalprobability measure P ′ N = P Ni =1 δ ξ ′ i /N . Clearly, considering the empirical probability measure rather than thenoisy data directly imposes no loss of information as the order of the data points is of no consequence here.Our main observation will be that this suﬃcient statistic P ′ N satisﬁes a large deviation principle [7], with a ratefunction which carefully balances an entropic optimal transport distance and an entropic divergence. Theorem 4.1.

Let Assumption 3.1 hold. Then, the suﬃcient statistic P ′ N satisﬁes for any open subset O ⊆P (Ξ ′ ) the large deviation lower bound − inf P ′ ∈O I ( P ′ , P ) ≤ lim inf N →∞ N log Prob P [ P ′ N ∈ O ] (6a)5 nd for any closed subset C ⊆ P (Ξ ′ ) the large deviation upper bound lim sup N →∞ N log Prob P [ P ′ N ∈ C ] ≤ − inf P ′ ∈C I ( P ′ , P ) . (6b) for the good and convex rate function I ( P ′ , P ) = inf Q ∈P (Ξ) D W ( P ′ , Q ) + D KL ( Q, P ) + D KL ( P ′ , m ′ ) ≥ . (7) Proof.

Consider ﬁrst the statistic T ′ N = P Ni =1 δ ( ξ i ,ξ ′ i ) /N . We ﬁrst show that this statistic satisﬁes a largedeviation property [7, Section 1.2]. Second, as we have that P ′ N = Π Ξ ′ T ′ N , the large deviation inequalities (6a)and (6b) for P ′ N can be established via a contraction princple [7, Theorem 4.2.1].Let T ( P ) be the joint distribution of the random variables ( ξ i , ξ ′ i ) and note that under Assumption 3.1 we have T ( P )( B ) := Z { ( ξ, ξ ′ ) ∈ B } exp( − d ( ξ, ξ ′ )) d m ′ ( ξ ′ )d P ( ξ ) ∀ B ∈ B (Ξ × Ξ ′ ) . (8)Equivalently, d T ( P ) / d( P ⊗ m ′ )( ξ, ξ ′ ) = exp( − d ( ξ, ξ ′ )) . Clearly, the noisy observational model in Equation (5)guarantees that each of the samples in the sequence { ( ξ i , ξ ′ i ) } is independent and identically distributed. Anempirical distribution T ′ N of independent and identically distributed samples following distribution T ( P ) enjoysa large deviation property with rate function D KL ( T ′ , T ( P )) [7, Theorem 6.2.10]. Assume that T ′ ≪ T ( P ) asotherwise D KL ( T ′ , T ( P )) = + ∞ . Under Assumption 3.1 the condition T ′ ≪ T ( P ) is furthermore equivalent toΠ Ξ T ′ ≪ P and Π Ξ ′ T ′ ≪ m ′ . We have under this assumption that D KL ( T ′ , T ( P )):= Z log (cid:18) d T ′ d T ( P ) ( ξ, ξ ′ ) (cid:19) d T ′ ( ξ, ξ ′ )= Z log (cid:18) d T ′ d P ⊗ m ′ ( ξ, ξ ′ ) (cid:19) d T ′ ( ξ, ξ ′ ) − Z log (cid:18) d T ( P )d P ⊗ m ′ ( ξ, ξ ′ ) (cid:19) d T ′ ( ξ, ξ ′ )= D KL ( T ′ , P ⊗ m ′ ) − Z log (exp( − d ( ξ, ξ ′ ))) d T ′ ( ξ, ξ ′ )= D KL ( T ′ , P ⊗ m ′ ) + Z d ( ξ, ξ ′ ) d T ′ ( ξ, ξ ′ )= Z d ( ξ, ξ ′ ) d T ′ ( ξ, ξ ′ ) + D KL ( T ′ , Π Ξ T ′ ⊗ Π Ξ ′ T ′ ) + D KL (Π Ξ T ′ , P ) + D KL (Π Ξ ′ T ′ , m ′ ) ≥ P ′ N = Π Ξ ′ T N , a large deviationproperty for P ′ N can now be established via a contraction principle [7, Theorem 4.2.1] as the projection operatorΠ Ξ ′ : P (Ξ × Ξ ′ ) → P (Ξ ′ ) is continuous. For any sequence T k ∈ P (Ξ × Ξ ′ ) with limit ¯ T ∈ P (Ξ × Ξ ′ ) we have bydeﬁnition of the weak topology that Z c ( ξ, ξ ′ ) d T k ( ξ, ξ ′ ) → Z c ( ξ, ξ ′ ) d ¯ T ( ξ, ξ ′ )for all bounded and continuous functions c : Ξ × Ξ ′ → R. Consequently, for any bounded and continuousfunction c ′ : Ξ ′ → R we have that Z c ′ ( ξ ′ ) dΠ Ξ ′ T k ( ξ ′ ) = Z c ′ ( ξ ′ ) d T k ( ξ, ξ ′ ) → Z c ′ ( ξ ′ ) d ¯ T ( ξ, ξ ′ ) = Z c ′ ( ξ ′ ) dΠ Ξ ′ ¯ T ( ξ ′ )where we use that also ( ξ, ξ ′ ) c ′ ( ξ ′ ) is bounded and continuous as a map from Ξ × Ξ ′ to R. Marginal projection A rate function I is good if its sublevel sets { P ′ ∈ P (Ξ ′ ) : I ( P ′ , P ) ≤ r } for any r ≥ P ∈ P are compact [7]. Π Ξ ′ T is indeed continuous as we have Π Ξ ′ T k → Π Ξ ′ ¯ T for any converging sequence T k ∈ P (Ξ × Ξ ′ ) withlimit ¯ T ∈ P (Ξ × Ξ ′ ). Consequently, via a contraction princple [7, Theorem 4.2.1], the large deviation inequalities(6a) and (6b) of the statistic P ′ N can be established for the rate function I ( P ′ , P )= inf { D KL ( T ′ , T ( P )) : T ′ s . t . Π Ξ ′ T ′ = P ′ } = inf { D KL ( T ′ , T ( P )) : Q ∈ P (Ξ) , T ′ ≪ T ( P ) s . t . Π Ξ T ′ = Q, Π Ξ ′ T ′ = P ′ } = inf (cid:8) R d ( ξ, ξ ′ ) d T ′ ( ξ, ξ ′ ) + D KL ( T ′ , Π Ξ T ′ ⊗ Π Ξ ′ T ′ ) + D KL ( Q, P ) + D KL ( P ′ , m ′ ) : Q ∈ P (Ξ) , T ′ ∈ T ( Q, P ′ ) (cid:9) = inf Q ∈P (Ξ) D W ( P ′ , Q ) + D KL ( Q, P ) + D KL ( P ′ , m ′ ) . The joint convexity of the rate function I ( P ′ , P ) = inf { D KL ( T ′ , T ( P )) : T ′ s . t . Π Ξ ′ T ′ = P ′ } is inherited [17,Theorem 1.31] from the joint convexity of the objective function D KL ( T ′ , T ) in ( T ′ , T ), the linearity of T ( P )in P evident from Equation (8) and the linearity of the projection Π Ξ ′ T ′ in T ′ .We remark that large deviation inequalities generally are quite rough in nature as indeed (6a) and (6b) onlypertain to open or closed sets, respectively. Theorem 4.1 states that the rate function I ( P ′ , P ) is alwaysnonnegative and the fact that I ( P ′ , P ) = 0 if and only if P ′ = O ⋆ P can easily be deduced from its proof andthe observation that D KL ( T ′ , T ) = 0 ⇐⇒ T ′ = T . For any ǫ >

0, the large deviation inequality (6b) despiteits rough nature nevertheless implieslim sup N →∞ N log Prob P [ D M ′ ( P ′ N , O ⋆ P ) ≥ ǫ ] ≤ − min { I ( P ′ , P ) : P ′ ∈ D M ′ ( P ′ , P ) ≥ ǫ } < ∀ P ∈ P where the minimum is indeed achieved as our good rate function has compact sublevel sets and the set of all P ′ ∈ P (Ξ ′ ) such that D M ′ ( P ′ , O ⋆ P ) ≥ ǫ is by deﬁnition closed and does not contain the distribution O ⋆ P .Hence, our large deviation property immediately implies that the empirical probability measure P ′ N convergesin probability to P ′ = O ⋆ P with an increasing number of observations. In fact, the rate function can beinterpreted as the appropriate yardstick with which to measure how fast this convergence takes place.

In this section we present two distinct statistical problems in which optimal transport distances become appro-priate in the face of noisy observational data. That is, we indicate that our rate function I induces optimalstatistical properties in two distinct problem settings. A hypothesis testing problem class will serve to illustratethe optimality of our considered optimal transport distance in a ﬁrst predictive setting. This example enforcesthe ﬁndings of Rigollet and Weed [21] in so far that an optimal transport distance is shown to be sensible in apredictive context. The optimality of ambiguity sets associated with the rate function I is also established ina second prescriptive setting. Hence, we illustrate by means of example that optimal transport distances enjoyoptimal statistical guarantees in both predictive and prescriptive settings in the face of noisy observational data. We consider a hypothesis testing problem in which we need to determine whether the sequence of noisy datapoints ξ ′ , . . . , ξ ′ N deﬁned in Equation (5) is produced by the unobserved probability measure P or alternativelyby P . Hypothesis testing problems can be regarded as simple prediction problems in which P = { P , P } Deﬁnition 5.1 (Hypothesis test) . A hypothesis test ˜ h is a measurable functions ˜ h : P (Ξ ′ ) → { P , P } . Wedenote with R (˜ h ) = n P ′ ∈ P (Ξ ′ ) : ˜ h ( P ′ ) = P o its rejection region.A hypothesis test ˜ h has as interpretation that for data with empirical distribution P ′ N , the null hypothesis ( P generated the data) is accepted if ˜ h ( P ′ N ) = P , while its alternative is accepted ( P generated the data) if˜ h ( P ′ N ) = P . We associate with each hypothesis test its asymptotic exponential error rateslim sup N →∞ N log Prob P h ˜ h ( P ′ N ) = P i = lim sup N →∞ N log Prob P h P ′ N ∈ R (˜ h ) i , (9)lim sup N →∞ N log Prob P h ˜ h ( P ′ N ) = P i = lim sup N →∞ N log Prob P h P ′ N

6∈ R (˜ h ) i . (10)Clearly, the statistical performance of a hypothesis test should be based on how well it can balance the desireto keep both the ﬁrst and second error probabilities small. It is quite common to only consider hypothesis testswhich suﬀer type I errors at rate at most − r , i.e., hypothesis tests for which (9) ≤ − r. Such hypothesis testsguarantee that the null hypothesis is not erroneously rejected all that often. Given this requirement, we wouldnow like to ﬁnd hypothesis tests which additionally enjoy an optimal type II error rate, i.e., a hypothesis testsfor which the probabilities Prob P h P ′ N

6∈ R (˜ h ) i decay exponentially as fast as possible.To that end let the δ -smoothed rate function be deﬁned as I δ ( P ′ , P ) := inf { I ( P ′′ , P ) : P ′′ ∈ P (Ξ ′ ) , D M ′ ( P ′′ , P ′ ) ≤ δ } . (11)Fix a radius r > h δ such that for all δ > h δ ( P ′ N ) =  P if I δ ( P ′ N , P ) ≤ r,P otherwise . The proposed family of hypothesis tests can be shown to be almost asymptotically optimal using merely a largedeviations argument. Due to the rough nature of the large deviation property, we will consider in the samespirit as [7, Section 7.1] for any hypothesis test ˜ h also its 0 < ǫ -open inﬂated rejection regions R ǫ (˜ h ) = n P ′′ ∈ P (Ξ ′ ) : P ′ ∈ R (˜ h ) , D M ′ ( P ′′ , P ′ ) < ǫ o . Theorem 5.2.

The family of hypothesis tests ˜ h δ satisﬁes for any δ > a type I error which satisﬁes lim sup N →∞ N log Prob P h P ′ N ∈ R (˜ h δ ) i ≤ − r. (12) For any other hypothesis test ˜ h that satisﬁes a type I error with lim sup N →∞ N log Prob P h P ′ N ∈ R ǫ (˜ h ) i ≤ − r (13) for some ǫ > , there exists furthermore a δ ′ > (independent from P ) so that for all < δ ≤ δ ′ we have lim inf N →∞ N log Prob P h P ′ N

6∈ R (˜ h ) i ≥ lim inf N →∞ N log Prob P h P ′ N

6∈ R (˜ h δ ) i . (14)8 roof. We start by proving that our family of formulations is feasible and satisﬁes inequality (12). Note that R (˜ h δ ) = { P ′ ∈ P (Ξ ′ ) : I δ ( P ′ , P ) > r } . We may assume without loss of generality that we have R (˜ h δ ) = ∅ for otherwise the error probability Prob P [ P ′ N ∈R (˜ h δ )] = 0 for N ≥ P ′ ∈ cl R (˜ h δ ) = ⇒ I ( P ′ , P ) > r .For the sake of contradiction assume that we have found P ′ ∈ cl R (˜ h δ ) for which I ( P ′ , P ) ≤ r . There mustexist now a sequence P ′ k ∈ R (˜ h δ ) which converges to P ′ and hence D M ′ ( P ′ k , P ′ ) tends to zero. However, fromthe deﬁnition of the smooth rate function I δ we have that in fact for all Q ′ ∈ P (Ξ ′ ) such that D M ′ ( Q ′ , P ′ k ) ≤ δ we have that I ( Q ′ , P ) > r . Take now k large enough such that D M ′ ( P ′ k , P ′ ) = D M ′ ( P ′ , P ′ k ) ≤ δ then we musthave I ( P ′ , P ) > r ; a contradiction. The above reasoning implies using the large deviation inequality (6b) thatlim sup N →∞ N log Prob P h P ′ N ∈ R (˜ h δ ) i ≤ − inf P ′ ∈ cl R (˜ h δ ) I ( P ′ , P ) ≤ − r establishing inequality (12).We will now prove that for any hypothesis test ˜ h which satisﬁes inequality (13) for some ǫ > δ ′ > R (˜ h ) ⊆ R (˜ h δ ′ ) (15)from which inequality (14) follows immediately as we note that indeed R (˜ h δ ′ ) ⊆ R (˜ h δ ) for all 0 < δ ≤ δ ′ .It remains to prove inequality (15). Assume for the sake of contradiction that Q ′′ ( δ ) ∈ R (˜ h ) and Q ′′ ( δ ) / ∈ R (˜ h δ )for all δ >

0. By deﬁnition of the smooth rate function I δ stated in Equation (11) and the hypothesis test ˜ h δ wehave that we can ﬁnd an auxiliary sequence Q ⋆ ( δ ) ∈ P (Ξ ′ ) so that D M ′ ( Q ⋆ ( δ ) , Q ′′ ( δ )) ≤ δ and I ( Q ⋆ ( δ ) , P ) ≤ r .As the rate function I is good there exist a sequence δ k to that Q ⋆ ( δ k ) converges to some Q ⋆ ∈ cl R (˜ h ) ⊆ R ǫ (˜ h )with I ( Q ⋆ , P ) ≤ r . Deﬁne now the continuous function Q ′ : [0 , → P (Ξ ′ ) , λ λ · O ⋆ P + (1 − λ ) · Q ⋆ and recall that I ( O ⋆ P , P ) = 0. From convexity of the rate function I and the fact that R ǫ (˜ h ) is an openset containing R (˜ h ), there must exist λ ′ ∈ (0 ,

1] suﬃciently small so that with Q ′ = Q ′ ( λ ′ ) we have using thesecant inequality I ( Q ′ , P ) ≤ λI ( O ⋆ P , P ) + (1 − λ ′ ) I ( Q ⋆ , P ) ≤ (1 − λ ′ ) r = r ′ < r and Q ′ ∈ R ǫ (˜ h ). From thelarge deviation inequality (6a) we have − r ′ ≤ − I ( Q ′ , P ) < − inf P ′ ∈R ǫ (˜ h ) I ( P ′ , P ) ≤ lim inf N →∞ N log Prob P h P ′ N ∈ R ǫ (˜ h ) i directly contradicting inequality (13) as we have established that r ′ < r .Inequality (12) guarantees that our proposed family of hypothesis tests does not make type I errors all thatoften. That is, the probability of making an type I error decays to zero exponentially fast at rate at least − r .Inequalities (13) and (14) state that our proposed family almost dominates any other hypothesis test. That is,any other hypothesis test which enjoys a type I error guarantee which is stronger by even the smallest amount ǫ > < δ ≤ δ ′ and 0 < δ ′ suﬃciently small. Furthermore, remark that our family of tests does not depend on the alternative hypothesis P in any way. It is hence a universally optimal family in that among all feasible tests it suﬀers an essentiallyminimal type II error probability rate whatever the alternative hypothesis P may be. Remark 5.3.

In view of the previous discussion it would perhaps be tempting to consider the hypothesis test˜ h such that ˜ h ( P ′ N ) =  P if I ( P ′ N , P ) ≤ r,P otherwise . However, remark that when the base measure m ′ deﬁned in Assumption 3.1 fails to be atomic, then I ( P ′ N , P ) =+ ∞ for the (atomic) empirical probability distribution P ′ N of our noisy data. Consequently, the null hypothesis9s never accepted by the test ˜ h which is clearly undesirable as we have under such circumstancesProb P h P ′ N ∈ R (˜ h ) i = 1 ∀ N ≥ . Hence, some degree of smoothing of the rate function as done in Equation (11) seems unavoidable.

Consider a prescriptive problem in which we attempt to learn the solution to the stochastic optimization problemstated in Equation (2) from the noisy observational data deﬁned in Equation (5). Let us denote with P ml N themaximum likelihood estimate proposed by [21] for the unobserved probability distribution P . A straightforwardextension of the sample average formulation in Equation (3) to this noisy data would be to consider z ( P ml N ) ∈ arg ǫ inf z ∈ Z E P ml N [ ℓ ( z, ξ )] . (16)Many other formulations based on diﬀerent distributional estimates are evidently possible as well. The kerneldeconvolution estimate P kd proposed in [25] states one such alternative and yields yet another data-drivenformulation z ( P kd N ) ∈ arg ǫ inf z ∈ Z E P kd N [ ℓ ( z, ξ )] . (17)This naturally leads us to question if between these two data-driven formulations one ought to be preferred overthe other from a statistical point of view? To answer this question more broadly we must of course ﬁrst deﬁneprecisely what constitutes a data-driven formulation and secondly agree on how its statistical performance shouldbe quantiﬁed. We follow the framework presented in [27] and deﬁne a data-driven formulation as consisting ofa predictor and prescriptor. Deﬁnition 5.4 (Data-driven predictors and prescriptors) . A measurable function ˜ c : Z × P (Ξ ′ ) → R is called adata-driven predictor. A measurable function ˜ z : P (Ξ ′ ) → Z is called a data-driven prescriptor if there exists adata-driven predictor ˜ c that induces ˜ x in the sense that ˜ z ( P ′ ) ∈ arg ǫ inf z ∈ Z ˜ c ( z, P ′ ) for all P ′ ∈ P (Ξ ′ ). That is,we have ˜ c (˜ z ( P ′ ) , P ′ ) − ǫ < ˜ v ( P ′ ) := inf z ∈ Z ˜ c ( z, P ′ ) where we denote the function ˜ v : P (Ξ ′ ) → R as the optimalvalue function of the formulation.The maximum likelihood formulation (16) and the kernel deconvolution formulation (17) employ the cost pre-dictors E P ml N [ ℓ ( z, ξ )] and E P kd N [ ℓ ( z, ξ )] to prescribe their decisions z ( P ml N ) and z ( P kd N ), respectively. However,both the maximum likelihood and kernel deconvolution formulation are based on a point estimate for the un-observed probability distribution P and consequently can be expected to suﬀers similar shortcomings as thesample average formulation. That is, the cost budgeted for its prescribed decision is likely to disappoint out ofsample. Here we say a formulation based on a predictor prescriptor pair (˜ c, ˜ z ) disappoints if the event P ′ N ∈ D (˜ c, ˜ z ; P ) := { P ′ ∈ P (Ξ ′ ) : c (˜ z ( P ′ ) , P ) > ˜ c (˜ z ( P ′ ) , P ′ ) } occurs with c ( z, P ) = E P [ ℓ ( z, ξ )]. Such disappointment events in which the actual cost of our decision c (˜ z ( P ′ N ) , P ) breaks the predicted cost or budget ˜ c (˜ z ( P ′ N ) , P ′ N ) may result in severe ﬁnancial consequencesand should be avoided by the decision-maker. Consequently, we would prefer formulations which keep thedisappointment rates lim sup N →∞ N log Prob P [ P N ∈ D (˜ c, ˜ z ; P )] (18)as small as possible for all P ∈ P . Evidently, this can be achieved trivially by simply inﬂating the cost budgetedfor each decision by some large nonnegative amount ρ >

0. Indeed, the disappointment probabilitiesProb P [ P ′ N ∈ D (˜ c + ρ, ˜ z ; P )] = Prob P [ c (˜ z ( P ′ N ) , P ) > ˜ c (˜ z ( P ′ N ) , P ′ N ) + ρ ] (19)10an be made arbitrarily small by choosing a large enough additive inﬂation constant ρ . However, data-drivenformulations with overly conservative cost estimates are clearly undesirable as the ultimate budgeted cost fortheir optimal decision would compare unfavorable to that of a competitor using more aggressive pricing andshould hence also be avoided. We will only denote here formulations feasible if their out-of-sample disappoint-ment probability decays suﬃciently fast, i.e., (18) ≤ − r . Naturally, we would prefer formulations which promiseminimal long term costs ˜ c (˜ z ( O ⋆ P ) , O ⋆ P ) for all P ∈ P as indeed we have that the observed random cost˜ c (˜ z ( P ′ N ) , P ′ N ) converges almost surely to ˜ c (˜ z ( O ⋆ P ) , O ⋆ P ) as the empirical distribution P ′ N converges almostsurely to O ⋆ P following [9, Theorem 11.4.1] for every P ∈ P .We consider therefore the family of robust formulations utilizing predictor prescriptor pairs˜ c δ ( z, P ′ N ) := sup (cid:8) E P [ ℓ ( z, ξ )] : P ∈ P , I δ ( P ′ N , P ) ≤ r (cid:9) , ˜ z δ ( P ′ N ) ∈ arg ǫ inf z ∈ Z ˜ c δ ( z, P ′ N ) (20)based on our smooth large deviation rate function deﬁned in Equation (11). We will show using a largedeviation argument that this family dominates the very rich class of formulations based on regular predictorsand prescriptors. Deﬁnition 5.5 (Regular predictors and prescriptors) . A data-driven predictor ˜ c : Z × P (Ξ ′ ) → R is calledregular if it is continuous on Z × P (Ξ ′ ). A data-driven prescriptor ˜ z : P (Ξ ′ ) → Z is called a regular ifit is continuous and there exists a regular data-driven predictor ˜ c that induces ˜ x in the sense that ˜ x ( P ′ ) ∈ arg ǫ inf z ∈ Z ˜ c ( z, P ′ ) for all P ′ ∈ P (Ξ ′ ).Remark that the class of all predictor prescriptor pairs is very rich as Deﬁnition 5.5 imposes only mild structuralrestrictions. The Berge maximum theorem [3, p. 116] indeed implies that the optimal value function ˜ v ( P ′ ) =min z ∈ Z ˜ c ( z, P ′ ) of any regular formulation is a continuous function on P ′ (Ξ) already when the constraint set Z is merely compact. The correspondence P ′

7→ { z ∈ Z : ˜ c ( z, P ′ ) < ˜ v ( P ′ ) + ǫ } of ǫ -suboptimal solutions ina regular formulation is consequently lower semicontinuous [2, Corollary 4.2.4.1] for any ǫ >

0. Hence, forformulations employing a convex predictor ˜ c and P (Ξ ′ ) a compact set, an associated regular predictor canalways be found [1, Theorem 9.1.]. Should a regular formulation admit unique optimal decisions, such decisionswill constitute a regular prescriptor as well following [3, p. 117]. The need to focus on this restricted butnevertheless quite rich class of regular formulations is necessary due to the rough nature of the employed largedeviation argument. Assumption 5.6.

The cost function c : Z × P → R , ( z, P ) E P [ ℓ ( z, ξ )] is continuous. Theorem 5.7.

Let Assumption 5.6 hold. Then, the family of predictor prescriptor pairs (˜ c δ , ˜ z δ ) is feasible forany δ > , i.e., lim sup N →∞ N log Prob P (cid:2) P ′ N ∈ D (˜ c δ , ˜ z δ ; P ) (cid:3) ≤ − r ∀ P ∈ P . (21) Furthermore, for any regular predictor prescriptor pair (˜ c, ˜ z ) that satisﬁes lim sup N →∞ N log Prob P [ P ′ N ∈ D (˜ c, ˜ z ; P )] ≤ − r ∀ P ∈ P , (22) we have that for all ǫ > there exists < δ ′ so that for all < δ ≤ δ ′ we have ˜ c (˜ z ( O ⋆ P ) , O ⋆ P ) + 2 ǫ ≥ ˜ c δ (˜ z δ ( O ⋆ P ) , O ⋆ P ) ∀ P ∈ P . (23) Proof.

We start by proving that our family of formulations is feasible and satisﬁes inequality (21). To this enddeﬁne the sets D δ ( P ) = { P ′ ∈ P (Ξ ′ ) : sup z ∈ Z c ( z, P ) − ˜ c δ ( z, P ′ ) > } and B δ ( P ) = { P ′ ∈ P (Ξ ′ ) : I δ ( P ′ , P ) > r } .

11e may assume without loss of generality that we have D δ ( P ) = ∅ for otherwise the out-of-sample disappoint-ment Prob P [ P ′ N ∈ D (˜ c δδ , ˜ z δδ ; P ) ⊆ D δ ( P )] clearly vanishes for all N ≥ D δ ( P ) ⊆ B δ ( P ). For the sake of contradiction, choose any P ′ ∈ D δ ( P ), and assumethat I δ ( P ′ , P ) ≤ r . Thus, we have for some z ∈ Z that c ( z, P ) > ˜ c δ ( z, P ′ ) = sup (cid:8) c ( z, Q ) : Q ∈ P , I δ ( P ′ , Q ) ≤ r (cid:9) ≥ c ( z, P );a contradiction. As P ′ ∈ D δ ( P ) was chosen arbitrarily, we have thus shown that D δ ( P ) ⊆ B δ ( P ) and hencealso cl D δ ( P ) ⊆ cl B δ ( P ). Next we show that P ′ ∈ cl B δ ( P ) = ⇒ I ( P ′ , P ) > r . For the sake of contradictionassume that we have found P ′ ∈ cl B δ ( P ) for which I ( P ′ , P ) ≤ r . There must exist a sequence P ′ k ∈ B δ ( P )which converge to P ′ and hence D M ′ ( P ′ k , P ′ ) tends to zero. However, from the deﬁnition of B δ ( P ) we have thatin fact for all Q ′ ∈ P (Ξ ′ ) such that D M ′ ( Q ′ , P ′ k ) ≤ δ we have that I ( Q ′ , P ) > r . Take now k large enough suchthat D M ′ ( P ′ k , P ′ ) = D M ′ ( P ′ , P ′ k ) ≤ δ then we must have I ( P ′ , P ) > r ; a contradiction. The above reasoningimplies thatlim sup N →∞ N log Prob P (cid:2) c (˜ z δ ( P ′ N ) , P ) > ˜ c δ (˜ z δ ( P ′ N ) , P ′ N ) (cid:3) ≤ lim sup N →∞ N log Prob P (cid:20) sup z ∈ Z c ( z, P ) − ˜ c δ ( z, P ′ N ) > (cid:21) ≤ − inf P ′ ∈ cl B δ ( P ) I ( P ′ , P ) ≤ − r establishing inequality (21) as P ∈ P was arbitrary.We will now prove that for any regular predictor prescriptor pair (˜ c, ˜ z ) which satisﬁes inequality (22) we havelim δ → inf P ′ ∈P ′ ˜ c (˜ z ( P ′ ) , P ′ ) − ˜ c δ (˜ z ( P ′ ) , P ′ ) ≥ . (24)From inequality (24) we can take δ ′ > c (˜ z ( P ′ ) , P ′ ) ≥ ˜ c δ ′ (˜ z ( P ′ ) , P ′ ) − ǫ ≥ ˜ c δ ′ (˜ z δ ′ ( P ′ ) , P ′ ) − ǫ uniformly for all P ′ ∈ P ′ = { O ⋆ P : P ∈ P} . Remark that we have clearly ˜ c δ ( z, P ′ ) ≤ ˜ c δ ′ ( z, P ′ ) for all z ∈ Z , P ′ ∈ P (Ξ ′ ) and 0 < δ ≤ δ ′ from which inequality (23) follows immediately.We will now establish inequality (24) by showing thatlim δ → inf P ′ ∈P ′ ˜ c (˜ z ( P ′ ) , P ′ ) − ˜ c δ (˜ z ( P ′ ) , P ′ ) ≥ − ρ (25)for any ρ >

0. Assume for the sake of contradiction that lim δ → inf P ′ ∈P ′ ˜ c (˜ z ( P ′ ) , P ′ ) − ˜ c δ (˜ z ( P ′ ) , P ′ ) < − ρ .There must hence exist a δ ′ > δ ≤ δ ′ we have inf P ′ ∈P ′ ˜ c (˜ z ( P ′ ) , P ′ ) − ˜ c δ (˜ z ( P ′ ) , P ′ ) < − ρ .Consequently, there exists a distribution Q ′′ ( δ ) ∈ P ′ such that˜ c (˜ z ( Q ′′ ( δ )) , Q ′′ ( δ )) + 2 ρ < c δ (˜ z ( Q ′′ ( δ )) , Q ′′ ( δ )) ∀ δ ≤ δ ′ < sup (cid:8) c (˜ z ( Q ′′ ( δ )) , P ) : P ∈ P , I δ ( Q ′′ ( δ ) , P ) ≤ r (cid:9) ∀ δ ≤ δ ′ . Hence, there exists for all δ ≤ δ ′ a distribution P ⋆ ( δ ) ∈ P such that I δ ( Q ′′ ( δ ) , P ⋆ ( δ )) ≤ r and ˜ c (˜ z ( Q ′′ ( δ )) , Q ′′ ( δ ))+2 ρ < c (˜ z ( Q ′′ ( δ )) , P ⋆ ( δ )). From the deﬁnition of the smooth rate function I δ stated in Equation (11) this impliesthat there exists an auxiliary sequence Q ′′′ ( δ ) ∈ P (Ξ ′ ) such that I ( Q ′′′ ( δ ) , P ⋆ ( δ )) ≤ r with D M ′ ( Q ′′′ ( δ ) , Q ′′ ( δ )) ≤ δ for all δ ≤ δ ′ . Remark that lim δ → ˜ c (˜ z ( Q ′′′ ( δ )) , Q ′′′ ( δ )) = lim δ → ˜ c (˜ z ( Q ′′ ( δ )) , Q ′′ ( δ ))lim δ → c (˜ z ( Q ′′′ ( δ )) , P ⋆ ( δ )) = lim δ → c (˜ z ( Q ′′ ( δ )) , P ⋆ ( δ ))as the functions c : Z ×P → R, ˜ c : Z ×P (Ξ ′ ) → R and ˜ z : P (Ξ ′ ) → Z are continuous and D M ′ ( Q ′′′ ( δ ) , Q ′′ ( δ )) ≤ δ implies lim δ → Q ′′′ ( δ ) = lim δ → Q ′′ ( δ ). Consequently, there exists a δ ⋆ ∈ (0 , δ ′ ] so that with P ⋆ = P ⋆ ( δ ⋆ ) ∈ P Q ⋆ = Q ′′′ ( δ ⋆ ) ∈ P (Ξ ′ ) we have both | ˜ c (˜ z ( Q ′′ ( δ ⋆ )) , Q ′′ ( δ ⋆ )) − ˜ c (˜ z ( Q ⋆ ) , Q ⋆ ) | < ρ and | c (˜ z ( Q ′′ ( δ ⋆ )) , P ⋆ ) − c (˜ z ( Q ⋆ ) , P ⋆ ) | < ρ. Hence, we have both ˜ c (˜ z ( Q ⋆ ) , Q ⋆ ) < c (˜ z ( Q ⋆ ) , P ⋆ ) and I ( Q ⋆ , P ⋆ ) ≤ r . Deﬁne the continuous function Q ′ :[0 , → P (Ξ ′ ) , λ Q ⋆ P ⋆ · λ + Q ⋆ · (1 − λ ) and recall that I ( O ⋆ P ⋆ , P ⋆ ) = 0. As we have that the functions c : Z ×P → R, ˜ c : Z ×P (Ξ ′ ) → R and ˜ z : P (Ξ ′ ) → Z are continuous there hence exists Q ′ = Q ′ ( λ ′ ) for λ ′ ∈ (0 , I ( Q ′ , P ⋆ ) ≤ λI ( O⋆P ⋆ , P ⋆ )+(1 − λ ) I ( Q ⋆ , P ⋆ ) ≤ (1 − λ ′ ) · r = r ′ < r using the convexity ofthe rate function and ˜ c (˜ z ( Q ′ ) , Q ′ ) < c (˜ z ( Q ′ ) , P ⋆ ). Consequently, we have that Q ′ ∈ D (˜ c, ˜ x ; P ⋆ ) = int D (˜ c, ˜ x ; P ⋆ )as again we remark for a ﬁnal time that the functions c : Z × P → R, ˜ c : Z × P (Ξ ′ ) → R and ˜ z : P (Ξ ′ ) → Z arecontinuous. From the large deviation inequality (6a) it follows now that − r ′ ≤ − I ( Q ′ , P ⋆ ) < − inf P ′ ∈ int D (˜ c, ˜ x ; P ⋆ ) I ( P ′ , P ⋆ ) ≤ lim inf N →∞ N log Prob P [ P ′ N ∈ D (˜ c, ˜ z ; P ⋆ )]which is in direct contradiction with the feasibility inequality (22) as r ′ < r which establishes our inequality(25) as ρ > < δ ≤ δ ′ with 0 < δ ′ suﬃciently small. The previous theorem hence indicates that our family dominates any regularformulation in terms of balancing the desire for small out-of-sample disappointment as well as minimal budgetcosts under Assumption 5.6. Finally, we remark that Assumption 5.6 is rather mild and is already satisﬁedwhen the loss function ℓ : Z × Ξ → R is merely bounded and uniformly continuous.

Remark 5.8.

In view of the previous discussion it is again tempting to consider the data-driven formulationwith predictor and prescriptor˜ c ( z, P ′ N ) := sup { E P [ ℓ ( z, X )] : P ∈ P , I ( P ′ N , P ) ≤ r } , ˜ z ( P ′ N ) ∈ arg min z ∈ Z ˜ c ( z, P ′ N ) (26)based directly on our rate function I rather than its smooth counterpart I δ . Van Parys et al. [27] proves indeedthat when given access to the noiseless data in Equation (1) with empirical distribution P N the appropriate ratefunction is precisely the entropic distance D KL ( P N , P ) and formulation (4) enjoys optimal statistical propertiesvery similar to those found in Theorem 5.7. However, recall again that for noisy data when the base measure m ′ deﬁned in Assumption 3.1 fails to be atomic, the ambiguity set { P ∈ P : I ( P ′ N , P ) < ∞} = ∅ is trivialand consequently the associated data-driven formulation is here infeasible. Hence, considering the smooth ratefunction I δ instead of the rate function I directly seems unavoidable when faced with noisy observational data. References [1] J.-P. Aubin and H. Frankowska.

Set-Valued Analysis . Springer Science & Business Media, 2009.[2] B. Bank, J. Guddat, D. Klatte, B. Kummer, and K. Tammer.

Non-Linear Parametric Optimization .Springer, 1982.[3] C. Berge.

Topological Spaces: Including a Treatment of Multi-Valued Functions, VectorSpaces, and Convexity . Courier Corporation, 1997.[4] D. Bertsimas, V. Gupta, and N. Kallus. Data-driven robust optimization.

Mathematical Programming ,167(2):235–292, 2018.[5] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport.

Advances in neuralinformation processing systems , 26:2292–2300, 2013.136] E. Delage and Y. Ye. Distributionally robust optimization under moment uncertainty with application todata-driven problems.

Operations research , 58(3):595–612, 2010.[7] A. Dembo and O. Zeitouni.

Large Deviations Techniques and Applications , volume 38. SpringerScience & Business Media, 2009.[8] J. C. Duchi, P. W. Glynn, and H. Namkoong. Statistics of robust optimization: A generalized empiricallikelihood approach.

Mathematics of Operations Research , 2021.[9] R. M. Dudley.

Real Analysis and Probability . CRC Press, 2018.[10] P. M. Esfahani and D. Kuhn. Data-driven distributionally robust optimization using the Wassersteinmetric: Performance guarantees and tractable reformulations.

Mathematical Programming , 171(1-2):115–166, 2018.[11] N. Fournier and A. Guillin. On the rate of convergence in wasserstein distance of the empirical measure.

Probability Theory and Related Fields , 162(3):707–738, 2015.[12] R. Gao and A. J. Kleywegt. Distributionally robust stochastic optimization with Wasserstein distance. arXiv preprint arXiv:1604.02199 , 2016.[13] P. Groeneboom and J. A. Wellner.

Information bounds and nonparametric maximum likelihoodestimation , volume 19. Springer Science & Business Media, 1992.[14] Z. Hu and L. J. Hong. Kullback-leibler divergence constrained distributionally robust optimization.

Avail-able at Optimization Online , 2013.[15] D. Kuhn, P. M. Esfahani, V. A. Nguyen, and S. Shaﬁeezadeh-Abadeh. Wasserstein distributionally robustoptimization: Theory and applications in machine learning. In

Operations Research & ManagementScience in the Age of Analytics , pages 130–166. INFORMS, 2019.[16] R. O. Michaud. The Markowitz optimization enigma: Is “optimized” optimal?

Financial AnalystsJournal , 45(1):31–42, 1989.[17] T. Pennanen.

Introduction to Convex Optimization . 2019. URL https://nms.kcl.ac.uk/teemu.pennanen/co-new.pdf .[18] G. Peyr´e and M. Cuturi. Computational optimal transport: With applications to data science.

Founda-tions and Trends ® in Machine Learning , 11(5-6):355–607, 2019.[19] Y. V. Prokhorov. Convergence of random processes and limit theorems in probability theory. Theory ofProbability & Its Applications , 1(2):157–214, 1956.[20] H. Rahimian and S. Mehrotra. Distributionally robust optimization: A review. arXiv preprintarXiv:1908.05659 , 2019.[21] P. Rigollet and J. Weed. Entropic optimal transport is maximum-likelihood deconvolution.

ComptesRendus Mathematique , 356(11-12):1228–1235, 2018.[22] F. Santambrogio.

Optimal Transport for Applied Mathematicians . Birk¨auser, NY, 2015.[23] A. Shapiro, D. Dentcheva, and A. Ruszczy´nsk.

Lectures on Stochastic Programming: Modelingand Theory . SIAM, 2014.[24] B. W. Silverman.

Density Estimation for Statistics and Data Analysis , volume 26. CRC press,1986.[25] L. A. Stefanski and R. J. Carroll. Deconvolving kernel density estimators.

Statistics , 21(2):169–184, 1990.1426] B. P. Van Parys, P. J. Goulart, and D. Kuhn. Generalized Gauss inequalities via semideﬁnite programming.

Mathematical Programming , 156(1-2):271–302, 2016.[27] B. P. Van Parys, P. M. Esfahani, and D. Kuhn. From data to decisions: Distributionally robust optimizationis optimal.

Management Science , 2020. doi: https://doi.org/10.1287/mnsc.2020.3678.[28] W. Wiesemann, D. Kuhn, and M. Sim. Distributionally robust convex optimization.