[PDF] Wasserstein Distributionally Robust Stochastic Control: A Data-Driven Approach

Abstract

Standard stochastic control methods assume that the probability distribution of uncertain variables is available. Unfortunately, in practice, obtaining accurate distribution information is a challenging task. To resolve this issue, we investigate the problem of designing a control policy that is robust against errors in the empirical distribution obtained from data. This problem can be formulated as a two-player zero-sum dynamic game problem, where the action space of the adversarial player is a Wasserstein ball centered at the empirical distribution. We propose computationally tractable value and policy iteration algorithms with explicit estimates of the number of iterations required for constructing an ϵ -optimal policy. We show that the contraction property of associated Bellman operators extends a single-stage out-of-sample performance guarantee, obtained using a measure concentration inequality, to the corresponding multi-stage guarantee without any degradation in the confidence level. In addition, we characterize an explicit form of the optimal distributionally robust control policy and the worst-case distribution policy for linear-quadratic problems with Wasserstein penalty. Our study indicates that dynamic programming and Kantorovich duality play a critical role in solving and analyzing the Wasserstein distributionally robust stochastic control problems.

Full PDF

WWasserstein Distributionally Robust Stochastic Control:A Data-Driven Approach

Insoon Yang ∗ Abstract

Standard stochastic control methods assume that the probability distribution of uncertainvariables is available. Unfortunately, in practice, obtaining accurate distribution information isa challenging task. To resolve this issue, we investigate the problem of designing a control policythat is robust against errors in the empirical distribution obtained from data. This problemcan be formulated as a two-player zero-sum dynamic game problem, where the action space ofthe adversarial player is a Wasserstein ball centered at the empirical distribution. We proposecomputationally tractable value and policy iteration algorithms with explicit estimates of thenumber of iterations required for constructing an (cid:15) -optimal policy. We show that the contrac-tion property of associated Bellman operators extends a single-stage out-of-sample performanceguarantee , obtained using a measure concentration inequality, to the corresponding multi-stageguarantee without any degradation in the conﬁdence level. In addition, we characterize an ex-plicit form of the optimal distributionally robust control policy and the worst-case distributionpolicy for linear-quadratic problems with Wasserstein penalty. Our study indicates that dy-namic programming and Kantorovich duality play a critical role in solving and analyzing theWasserstein distributionally robust stochastic control problems.

The theory of stochastic optimal control is based on the assumption that the probability distributionof uncertain variables (e.g., disturbances) is fully known. However, this assumption is often restric-tive in practice, because estimating an accurate distribution requires large-scale high-resolutionsensor measurements over a long training period or multiple periods. Situations in which uncertainvariables are not directly observed are much more challenging; computational methods, such asﬁltering or statistical learning techniques, are often used to obtain the (posterior) distribution ofthe uncertain variables given limited observations. The accuracy of the obtained distribution isoften unsatisfactory, as it is subject to the quality of the collected data, computational methods,and prior knowledge regarding the variables. If poor distributional information is employed inconstructing a stochastic optimal controller, it does not guarantee optimality and can even causecatastrophic system behaviors (e.g., [1, 2]).To overcome this issue of limited distribution information in stochastic control, we investi-gate a distributionally robust control approach. This emerging minimax stochastic control methodminimizes a cost function of interest, assuming that the distribution of uncertain variables is not ∗ Department of Electrical and Computer Engineering, Automation and Systems Research Institute, Seoul NationalUniversity ([email protected]). Supported in part by NSF under ECCS-1708906 and CNS-1657100, ResearchResettlement Fund for the new faculty of Seoul National University (SNU), the Creative-Pioneering ResearchersProgram through SNU, the Basic Research Lab Program through the National Research Foundation of Korea fundedby the MSIT(2018R1A4A1059976), and Samsung Electronics. a r X i v : . [ m a t h . O C ] N ov completely known, but is contained in a pre-speciﬁed ambiguity set of probability distributions. Inthis paper, we model the ambiguity set as a statistical ball centered at an empirical distributionwith a radius measured by the Wasserstein metric . This modeling approach provides a straightfor-ward means to incorporate data samples into distributionally robust control problems. Our focusis to show that the resulting stochastic control problems have several salient features in terms ofcomputational tractability and out-of-sample performance guarantee.Due to its superior statistical properties, the Wasserstein ambiguity set has recently receiveda great deal of attention in distributionally robust optimization (e.g., [3, 4, 5, 6]), learning (e.g.,[7, 8]) and ﬁltering [9]. Speciﬁcally, the Wasserstein ball contains both continuous and discretedistributions while statistical balls with the φ -divergence such as the Kullback-Leibler divergencecentered at a discrete empirical distribution is not suﬃciently rich to contain relevant continuousdistributions. Furthermore, the Wasserstein metric addresses the closeness between two pointsin the support, unlike the φ -divergence. Due to the incapability of the φ -divergence in terms oftaking into account the distance between two support elements, the associated ambiguity set maycontain irrelevant distributions [5]. For these reasons, we chose the Wasserstein metric to handledistribution ambiguity, although several other types of ambiguity sets have been proposed in thecontext of single-stage optimization by using moment constraints (e.g., [10, 11, 12]), conﬁdencesets (e.g., [13]), and the φ -divergences (e.g., [14, 15]). Distributionally robust sequential decision-making problems have been studied in the context ofﬁnite Markov decision processes (MDPs) and continuous-state stochastic control. In the ﬁniteMDP setting, dynamic programming (DP) approaches have been proposed [16, 17, 18]. In [16],moment-based ambiguity sets are used to impose constraints on the moments of distributions, suchas mean and covariance. This approach is further extended to handle more types of constraints,such as conﬁdence sets and mean absolute deviation [17], by using the lifting technique given in[13]. Distributionally robust MDPs with Wasserstein balls are studied in [18], which providescomputationally tractable reformulations and useful analytical properties.Continuous-state distributionally robust control problems can be considered as a class of min-imax stochastic control on Borel spaces [19]. In the case of linear dynamics and quadratic costfunctions, [20] focuses on linear policies and proposes tractable semideﬁnite program formulationwhen moment constraints are imposed. A DP method is also proposed for moment-based ambiguitysets and applied to probabilistic safety speciﬁcation problems [21]. On the other hand, [22] uses atotal variation ball to model distribution ambiguity and proposes a modiﬁed version of the classicalpolicy iteration algorithm. Furthermore, a Riccati equation-based approach is also developed inthe linear-quadratic regulator setting with the total variation ambiguity set [23] and the relativeentropy constraint [24].

Departing from the aforementioned control approaches that indirectly use data samples, we con-sider continuous-state distributionally robust control problems with Wasserstein ambiguity sets anddevelop a dynamic programming method to solve and analyze problems by directly using the data.The following is a summary of the main contributions of this work. First, we propose computa-tionally tractable value and policy iteration algorithms with explicit estimates of the number ofiterations necessary for obtaining an (cid:15) -optimal policy. The original Bellman equation involves aninﬁnite-dimensional minimax optimization problem, where the inner maximization problem is overprobability measures in the Wasserstein ball. To alleviate the computational issue without sacri-ﬁcing optimality, we reformulate Bellman operators by using modern DRO based on Kantorovichduality [3, 5]. Second, we show that the resulting distributionally robust policy π (cid:63) has a probabilis-tic out-of-sample performance guarantee by using the contraction property of associated Bellmanoperators and a measure concentration inequality. In other words, when π (cid:63) is used, a probabilisticbound holds on the closed-loop performance evaluated under a new set of samples that are selectedindependently of the training data. We observe that the contraction property of the Bellman op-erator seamlessly connects a single-stage performance guarantee to its multi-stage counterpart ina manner that is independent of the number of stages. Third, we consider a Wasserstein penaltyproblem and derive an explicit expression of the optimal control policy and the worst-case distribu-tion policy, along with a Riccati-type equation in the linear-quadratic setting. We also show thatthe resulting control policy converges to the optimal policy of the corresponding linear-quadratic-Gaussian (LQG) problem as the penalty parameter tends to + ∞ . The performance and utility ofthe proposed method are demonstrated through an investment-consumption problem and a powersystem frequency control problem.This paper is signiﬁcantly extended from its preliminary version [25], which models distributionambiguity by using conﬁdence sets. Speciﬁcally, we consider Wasserstein ambiguity sets and in-vestigate new salient features of the corresponding distributionally robust control framework suchas ( i ) a characterization of the worst-case distribution policy, ( ii ) an out-of-sample performanceguarantee, and ( iii ) an explicit expression of the solution to linear-quadratic problems. In Section 2, we deﬁne optimal distributionally robust policies under ambiguous uncertainty andformulate the corresponding distributionally robust stochastic control problem as a dynamic game.In Section 3, we develop a tractable semi-inﬁnite program formulation of the Bellman equation andcharacterize one of the worst-case distribution policies by using Kantorovich duality. In Section 4,we examine a probabilistic out-of-sample performance guarantee of the distributionally robust pol-icy. In Section 5, we present the Wasserstein penalty problem and its explicit solution obtained froma Riccati-type solution. Finally, in Section 6, we provide the results of our numerical experiments.

Given a Borel space X , we denote P ( X ) by the set of Borel probability measures on X . In addition, B ξ ( X ) denotes the Banach space of measurable functions v on X with a ﬁnite weighted sup-norm,i.e., (cid:107) v (cid:107) ξ := sup x ∈ X ( | v ( x ) | /ξ ( x )) < ∞ given a measurable weight function ξ : X → R . Let B lsc ( X )be the set of lower semicontinuous functions in B ξ ( X ). Consider a discrete-time stochastic system of the form x t +1 = f ( x t , u t , w t ) , (2.1)where x t ∈ X ⊆ R n and u t ∈ U ⊆ R m denote the system state and control input, respectively.Here, w t ∈ W ⊆ R l is a random disturbance. The probability distribution of w t is denoted by µ t . However, in practice, the probability distribution is not fully known and is diﬃcult to estimateaccurately. We assume that X , U and W are Borel subsets of R n , R m and R l , respectively.Suppose that w t ’s are i.i.d. and that we have access to the sample { ˆ w (1) , . . . , ˆ w ( N ) } of w t .One of the most straightforward approaches is to use the sample average approximation (SAA)method and solve the corresponding optimal control problem with the empirical distribution. ThisSAA-control problem can be formulated as (SAA-control) inf π ∈ Π E πw t ∼ ν N (cid:20) ∞ (cid:88) t =0 α t c ( x t , u t ) | x = x (cid:21) , (2.2)where ν N denotes the empirical distribution constructed from the N -samples: ν N := 1 N N (cid:88) i =1 δ ˆ w ( i ) (2.3)with the Dirac delta measure δ ˆ w ( i ) concentrated at ˆ w ( i ) . Here, α ∈ (0 ,

1) is a discount factor, c : X × U → R is a stage-wise cost function of interest, and E πw t ∼ ν N denotes the expected valuetaken with respect to the probability measure induced by the control policy π and the empiricaldistribution ν . As the number of samples, N , tends to inﬁnity, the empirical distribution ν wellapproximates the true distribution µ ; thus, an optimal policy of the SAA-control problem presentsa near-optimal performance.Unfortunately, it takes a long simulation period or multiple episodes to obtain a large numberof samples. Furthermore, in practice, it is likely that the sample data do not reﬂect the truedistribution due to inaccurate sensor measurements or data corruption by malicious attackers (e.g.,hackers). To resolve these issues in data-driven stochastic control, we propose an optimizationmethod to construct a policy that is robust against errors in the empirical distribution (2.3). Morespeciﬁcally, our policy minimizes the worst-case total cost that is calculated under a probabilitydistribution contained in a given set D ⊂ P ( W ), which is called the ambiguity set of probabilitydistributions. The ambiguity set can be designed to adequately characterize errors in the empiricaldistribution. To formulate a concrete distributionally robust control problem, we consider a

Markov (or stochas-tic) game with complete information (e.g., [26, 19]), which is a class of two-player zero-sum dy-namic games: Player I (controller) determines a policy to minimize the total cost while Player II(adversary) selects the disturbance distribution µ t of w t from the ambiguity set D to maximizethe same cost value. Let H t be the set of histories up to stage t , whose element is of theform h t := ( x , u , · · · , x t − , u t − , x t ). The set of admissible control strategies (for Player I) isgiven by Π := { π := ( π , π , . . . ) | π t ( U ( x t ) | h t ) = 1 ∀ h t ∈ H t } , where π t is a stochastic ker-nel from H t to R m and U ( x t ) ⊆ U is the set of admissible control actions (given that the sys-tem state is x t at stage t ). Similarly, the set of Player II’s admissible strategies is deﬁned byΓ := { γ := ( γ , γ , . . . ) | γ t ( D| h et ) = 1 ∀ h et ∈ H et } , where H et is the set of extended histories upto stage t , whose element is of the form h et := ( x , u , µ , · · · , x t − , u t − , µ t − , x t , u t ) and γ t is a All the results in this paper are valid with histories of the form ˜ h t := ( x , u , w , µ , · · · , x t − , u t − , w t − , µ t − , x t )that also contains Player II’s actions ( µ , · · · , µ t − ); that is because under Assumption 1, without loss of optimality,it suﬃces to focus on stationary policies that depend only on current state information. We intentionally use thereduced version of histories, as the realized distributions may not be observable in practice. stochastic kernel from H t to P ( W ). Note that the ambiguity set D is the action space of PlayerII. Here, we allow Player II can change the distribution of w t over time. Thus, the strategy spacefor Player II is larger than necessary, and this gives an advantage to the adversary. However,later we will show that an optimal policy of Player II is stationary under some assumption (seeProposition 5).We consider the following inﬁnite-horizon discounted cost function: J x ( π, γ ) := E π,γ (cid:20) ∞ (cid:88) t =0 α t c ( x t , u t ) | x = x (cid:21) , (2.4)where E π,γ denotes expectation with respect to the probability measure induced by the strategypair ( π, γ ) ∈ Π × Γ.Before deﬁning a concrete stochastic control problem, we impose the following standard as-sumption for measurable selection in semicontinuous models [19]:

Assumption 1.

Let K := { ( x , u ) ∈ X × U | u ∈ U ( x ) } .1. The function c is lower semicontinuous on K , and | c ( x , u ) | ≤ bξ ( x ) ∀ ( x , u ) ∈ K , for some constant b ≥ and continuous function ξ : X → [1 , ∞ ) such that ξ (cid:48) ( x , u ) := (cid:82) W ξ ( f ( x , u , w )) µ (d w ) is continuous on K for any µ ∈ D . In addition, there exists a constant β ∈ [1 , /α ) such that ξ (cid:48) ( x , u ) ≤ βξ ( x ) for all ( x , u ) ∈ K ;2. For each continuous bounded function χ : X → R , the function χ (cid:48) ( x , u ) := (cid:82) W χ ( f ( x , u , w )) µ (d w ) is continuous on K for any µ ∈ D ;3. The set U ( x ) is compact for every x ∈ X , and the set-valued mapping x (cid:55)→ U ( x ) is uppersemicontinuous. The ﬁrst condition trivially holds when c is bounded. In fact, ξ is a weight function introducedto relax the boundedness assumption. Assumption 1 ensures the existence of an optimal policy π (cid:63) ,which is deterministic and stationary, of a minimax control problem with the cost function (2.6) [19,Theorem 4.1]. Furthermore, the corresponding optimal value function lies in B lsc ( X ) as discussedlater.We now deﬁne the optimal distributionally robust policies as follows: Deﬁnition 1.

A control policy π (cid:63) ∈ Π is said to be an optimal distributionally robust policy if itsatisﬁes sup γ ∈ Γ J x ( π (cid:63) , γ ) ≤ sup γ (cid:48) ∈ Γ J x ( π, γ (cid:48) ) ∀ π ∈ Π . (2.5)In words, an optimal distributionally robust policy achieves the minimal cost under the mostadverse policies that select disturbance distributions in the ambiguity set D . Such a desirable policycan be obtained by solving the following problem:(DR-control) inf π ∈ Π sup γ ∈ Γ J x ( π, γ ) , (2.6)which we call the distributionally robust control (DR-control) problem. The existence of an optimalpolicy under Assumption 1 will be formalized in Theorem 1 in Section 3.1.The most important part of this formulation is the inner maximization problem over all distur-bance distribution policies in Γ, which encodes distributional uncertainty through D . An optimalpolicy π (cid:63) has a performance guarantee in the form of an upper-bound, sup γ ∈ Γ J x ( π (cid:63) , γ ), if theambiguity set is suﬃciently large to contain the true distribution. This performance guarantee maynot be valid when a diﬀerent control policy is used, as shown in (2.5). To complete the formulation of the DR-control problem, we consider a speciﬁc class of ambiguitysets using the Wasserstein metric. Let D be a statistical ball centered at the empirical distribution ν N deﬁned by (2.3) with radius θ > D := { µ ∈ P ( W ) | W p ( µ, ν N ) ≤ θ } . (2.7)Here, the distance between the two probability distributions is measured by the Wasserstein metricof order p ∈ [1 , ∞ ), W p ( µ, ν N ) := min κ ∈P ( W ) (cid:26)(cid:20) (cid:90) W d ( w, w (cid:48) ) p κ (d w, d w (cid:48) ) (cid:21) p | Π κ = µ, Π κ = ν N (cid:27) , (2.8)where d is a metric on W , and Π i κ denotes the i th marginal of κ for i = 1 ,

2. The Wassersteindistance between two probability distributions represents the minimum cost of transporting or re-distributing mass from one to another via non-uniform perturbation, and the optimization variable κ can be interpreted as a transport plan.The minimization problem to identify an optimal transport plan κ in (2.8) is called the Monge-Kantorovich problem . The minimum of this problem can be found by solving the following dualproblem: W p ( µ, ν N ) p = sup ϕ,ψ ∈ Φ (cid:20) (cid:90) W ϕ ( w ) µ (d w ) + (cid:90) W ψ ( w (cid:48) ) ν N (d w (cid:48) ) (cid:21) , where Φ := { ( ϕ, ψ ) ∈ L (d µ ) × L (d ν N ) | ϕ ( w ) + ψ ( w (cid:48) ) ≤ d ( w, w (cid:48) ) p ∀ w, w (cid:48) ∈ W} . This equivalenceis known as the Kantorovich duality principle . Then, the Wasserstein ball (2.8) can be expressedas follows:

Lemma 1.

The Wasserstein ambiguity set deﬁned by (2.7) is equivalent to D = (cid:26) µ ∈ P ( W ) | (cid:90) W ϕ ( w ) µ (d w ) + 1 N N (cid:88) i =1 inf w ∈W [ d ( w, ˆ w ( i ) ) p − ϕ ( w )] ≤ θ p ∀ ϕ ∈ L (d µ ) (cid:27) . A proof for this lemma is contained in Appendix A. Note that the minimization problem in thereformulated Wasserstein ball is ﬁnite dimensional, unlike the original Monge-Kantorovich problem.In the following section, we propose computationally tractable value and policy iteration algorithmsby using the reformulation results in DRO based on Kantorovich duality.

Our ﬁrst goal is to develop a computationally tractable dynamic programming (DP) solution for theDR-control problem (2.6). We begin by characterizing an optimality condition using the Bellman’sprinciple.

For any v ∈ B ξ ( X ), let T be the Bellman operator of the DR-control problem (2.6), deﬁned by( T v )( x ) := inf u ∈U ( x ) sup µ ∈D (cid:20) c ( x , u ) + α (cid:90) W v ( f ( x , u , w )) µ (d w ) (cid:21) for every x ∈ X . Assumption 1 enables us to conduct the contraction analysis with respect to theweighted sup-norm (cid:107) · (cid:107) ξ deﬁned by (cid:107) v (cid:107) ξ := sup x ∈X | v ( x ) | ξ ( x ) . The second and third conditions in Assumption 1 play a critical role in preserving the lower semi-continuity of the value function when applying the Bellman operator as well as in the existenceand optimality of deterministic stationary policies. Let Π DS be the set of deterministic stationarypolicies, i.e., Π DS := { π : X → U | π ( x t ) = u t ∈ U ( x t ), π measurable } . Then, the following lemmashold: Lemma 2 (Contraction and Monotonicity) . Suppose that Assumption 1 holds. Then,

T v ∈ B lsc ( X ) for any v ∈ B lsc ( X ) . Furthermore, the Bellman operator T : B lsc ( X ) → B lsc ( X ) is a τ -contractionmapping with respect to (cid:107) · (cid:107) ξ , where τ := αβ ∈ (0 , , i.e., (cid:107) T v − T v (cid:48) (cid:107) ξ ≤ τ (cid:107) v − v (cid:48) (cid:107) ξ ∀ v, v (cid:48) ∈ B lsc ( X ) . Furthermore, T is monotone, i.e., T v ≤ T v (cid:48) ∀ v, v (cid:48) ∈ X ξ ( X ) s.t. v ≤ v (cid:48) . Lemma 3 (Measurable selection) . Suppose that Assumption 1 holds. There exist a measurablefunction v (cid:63) ∈ B lsc ( X ) and a deterministic stationary policy π (cid:63) ∈ Π DS such that1. v (cid:63) is the unique function in B lsc ( X ) that satisﬁes the following Bellman equation: v = T v ; (3.1)

2. given any ﬁxed x ∈ X , v (cid:63) ( x ) = sup µ ∈D (cid:20) c ( x , π (cid:63) ( x )) + α (cid:90) W v (cid:63) ( f ( x , π (cid:63) ( x ) , w )) µ (d w ) (cid:21) and lim t →∞ α t E π,γ [ v (cid:63) ( x t )] = 0 for all ( π, γ ) ∈ Π × Γ . These lemmas follow immediately from [19, Lemma 4.4 and Theorem 4.1]. In fact, for any v ∈ B lsc ( X ), there exists ˆ u ∈ U ( x ) such that ( T v )( x ) = sup µ ∈D [ c ( x , ˆ u ) + α (cid:82) W v ( f ( x , ˆ u , w )) µ (d w )]for every x ∈ X under Assumption 1 (see [19, Lemma 3.3]). If we let π (cid:63) ( x ) := ˆ u for each x ∈ X ,then π (cid:63) is an optimal distributionally robust policy, which is deterministic and stationary. Morespeciﬁcally, the following principle of optimality holds: Theorem 1 (Existence and optimality of deterministic stationary policy) . Suppose that Assump-tion 1 holds. Then, ( v (cid:63) , π (cid:63) ) ∈ B lsc ( X ) × Π DS deﬁned in Lemma 3 satisﬁes v (cid:63) ( x ) = inf π ∈ Π sup γ ∈ Γ J x ( π, γ ) = sup γ ∈ Γ J x ( π (cid:63) , γ ) ∀ x ∈ X . In words, v (cid:63) is the optimal value function of the DR-control problem (2.6) , and π (cid:63) is an optimalpolicy, which is deterministic and stationary. The existence and optimality results are shown in a more general minimax control setting in[19, Theorem 4.1]. Here, the constant β ∈ [1 , /α ) is deﬁned in Assumption 1-1). Thus, the outer minimization problem in the deﬁnition of T admits an optimal solution when v ∈ B lsc ( X ), and“inf” can be replaced by “min.” To compute the optimal value function v (cid:63) , we ﬁrst consider a value iteration (VI) approach, v k +1 := T v k , where v k denotes the value function evaluated at the k th iteration and v is initialized as anarbitrary function in B lsc ( X ). By the contraction property of T (Lemma 2), the Banach ﬁxed-pointtheorem implies that v k converges to v (cid:63) pointwise as k tends to ∞ under Assumption 1. However,this approach requires us to solve the inﬁnite-dimensional minimax optimization problem in theBellman operator for each x ∈ X in each iteration. To alleviate this issue, we reformulate theproblem into a computationally tractable form by using modern Wasserstein DRO [3, 5]. Proposition 1.

Suppose that the function w (cid:55)→ v ( f ( x , u , w )) lies in L (d ν N ) for each ( x , u ) ∈ K .Then, the Bellman operator T can be expressed as ( T v )( x ) = inf u ,λ,(cid:96) (cid:20) λθ p + c ( x , u ) + 1 N N (cid:88) i =1 (cid:96) i (cid:21) s.t. αv ( f ( x , u , w )) − λd ( w, ˆ w ( i ) ) p ≤ (cid:96) i ∀ w ∈ W u ∈ U ( x ) , λ ≥ , (cid:96) ∈ R N (3.2) for each x ∈ X , where the ﬁrst inequality constraint holds for all i = 1 , . . . , N . This reformulation can be obtained by using Kantorovich duality on the Wasserstein ambiguityset (Lemma 1). It is shown in [5, Theorem 1] that there is no duality gap.Note that the reformulated optimization problem in Proposition 1 has ﬁnite-dimensional deci-sion variables as u ∈ U ( x ) ⊆ U ⊆ R m , λ ∈ R and (cid:96) ∈ R N . However, the ﬁrst inequality constraintmust hold for all w in the support W , which could be a dense set. Thus, in general, the reformu-lated problem is a semi-inﬁnite program . This semi-inﬁnite program can be solved by using severalexisting convergent algorithms, such as discretization, sampling-based methods (see [27, 28, 29, 30]and the references therein).To interpret this reformulation, we consider the following equivalent integral form:( T v )( x ) = inf u ∈U ( x ) ,λ ≥ (cid:20) λθ p + (cid:90) W sup w ∈W (cid:2) c ( x , u ) + αv ( f ( x , u , w )) − λd ( w, w (cid:48) ) p (cid:3) ν N (d w (cid:48) ) (cid:21) . The integrand above can be interpreted as a regularized cost-to-go function. The regularized valueis then integrated using the empirical distribution ν N . The ﬁrst term λθ p , which is nonnegative,is added to compensate for this regularization eﬀect and the optimism induced by the empiricaldistribution so that the reformulated optimization problem is consistent with the original one.We deﬁne an (cid:15) -optimal policy of (2.6) as π (cid:15) ∈ Π that satisﬁes (cid:107) v π (cid:15) − v (cid:63) (cid:107) ξ < (cid:15) for (cid:15) >

0, where v π : X → R is the (worst-case) value function of a policy π ∈ Π, i.e., v π ( x ) := sup γ ∈ Γ J x ( π, γ ) . (3.3)The following VI algorithm can be used to ﬁnd an (cid:15) -optimal policy:1. Initialize v as an arbitrary function in B lsc ( X ), and set k := 0;2. For each x ∈ X , compute v k +1 ( x ) := ( T v k )( x )by solving the semi-inﬁnite program (3.2) with v := v k ;3. If the stopping criterion is met, then go to Step 4); Otherwise, set k ← k + 1 and go to Step2);4. For each x ∈ X , set ˆ π ( x ) := ˆ u , where ˆ u is an optimal u of the semi-inﬁnite program (3.2) that computes ( T v k )( x ), and stop.Note that the existence of an optimal ˆ u in Step 4) is guaranteed under Assumption 1 by [19, Lemma3.3]. A typical stopping criterion in VI is (cid:107) v k +1 − v k (cid:107) ξ < δ for some threshold δ >

0. However, wecan even compute the number of iterations required to achieve the desired precision (cid:15) >

0. Givenany π ∈ Π DS and v ∈ B ξ ( X ), let( T π v )( x ) := sup µ ∈D (cid:20) c ( x , π ( x )) + α (cid:90) W v ( f ( x , π ( x ) , w )) µ (d w ) (cid:21) for all x ∈ X . The Bellman operator T π has the following properties: Lemma 4.

Suppose that Assumption 1 holds. Then, given any π ∈ Π DS , we have T π v ∈ B ξ ( X ) for any v ∈ B ξ ( X ) . Furthermore, the operator T π : B ξ ( X ) → B ξ ( X ) is a τ -contraction mappingwith respect to (cid:107) · (cid:107) ξ , i.e., (cid:107) T π v − T π v (cid:48) (cid:107) ξ ≤ τ (cid:107) v − v (cid:48) (cid:107) ξ ∀ v, v (cid:48) ∈ B ξ ( X ) , where τ := αβ ∈ (0 , . Furthermore, T π is monotone, i.e., T π v ≤ T π v (cid:48) ∀ v, v (cid:48) ∈ X ξ ( X ) s.t. v ≤ v (cid:48) . Proof.

By Assumption 1, it is clear that T π v ∈ B ξ ( X ) if v ∈ B ξ ( X ). Fix arbitrary v, v (cid:48) ∈ B ξ ( X ),and an arbitrary x ∈ X . For any (cid:15) >

0, there exists ˆ µ ∈ D such that( T π v )( x ) − (cid:15) < c ( x , π ( x )) + α (cid:90) W v ( f ( x , π ( x ) , w )) ˆ µ (d w ) . Thus, we have( T π v )( x ) − ( T π v (cid:48) )( x ) − (cid:15) < α (cid:90) W [ v ( f ( x , π ( x ) , w )) − v (cid:48) ( f ( x , π ( x ) , w ))] ˆ µ (d w ) ≤ α (cid:90) W (cid:107) v − v (cid:48) (cid:107) ξ ξ ( f ( x , π ( x ) , w )) ˆ µ (d w ) ≤ α (cid:107) v − v (cid:48) (cid:107) ξ βξ ( x ) , where the last inequality holds due to Assumption 1-1). By switching the role of v and v (cid:48) , we alsohave ( T π v (cid:48) )( x ) − ( T π v )( x ) − (cid:15) ≤ αβ (cid:107) v − v (cid:48) (cid:107) ξ ξ ( x ). Since the two inequalities hold for any x ∈ X and (cid:15) >

0, and τ = αβ , we conclude that (cid:107) T π v − T π v (cid:48) (cid:107) ξ ≤ τ (cid:107) v − v (cid:48) (cid:107) ξ . It is straightforward tocheck that T π is monotone.This lemma implies that the value function v π is the unique ﬁxed point of T π in B ξ ( X ). Byusing the contraction property of T π and T , we can estimate the number of iterations needed toobtain an (cid:15) -optimal policy as follows:0 Proposition 2.

Suppose that Assumption 1 holds. We assume that given (cid:15) > , the total numberof iterations, k , in the VI algorithm satisﬁes k > log[(1 − τ ) (cid:15) ] − log(2 bτ )log τ , where b ≥ and τ ∈ (0 , are the constants deﬁned in Assumption 1 and Lemma 4, respectively.Then, ˆ π obtained by the VI algorithm is an (cid:15) -optimal policy, i.e., (cid:107) v ˆ π − v (cid:63) (cid:107) ξ < (cid:15). Proof.

By Lemma 4 and Theorem 1, we have v ˆ π , v k , v (cid:63) ∈ B ξ ( X ). We observe that (cid:107) v ˆ π − v (cid:63) (cid:107) ξ = (cid:107) T ˆ π v ˆ π − v (cid:63) (cid:107) ξ ≤ (cid:107) T ˆ π v ˆ π − T ˆ π v k (cid:107) ξ + (cid:107) T ˆ π v k − v (cid:63) (cid:107) ξ ≤ τ (cid:107) v ˆ π − v k (cid:107) ξ + (cid:107) T v k − T v (cid:63) (cid:107) ξ , where the last inequality holds because of Lemma 4, T ˆ π v k = T v k and v (cid:63) = T v (cid:63) . By Lemma 2, wehave (cid:107) v ˆ π − v (cid:63) (cid:107) ξ ≤ τ (cid:107) v ˆ π − v k (cid:107) ξ + τ (cid:107) v k − v (cid:63) (cid:107) ξ ≤ τ (cid:107) v ˆ π − v (cid:63) (cid:107) ξ + 2 τ (cid:107) v k − v (cid:63) (cid:107) ξ . (3.4)On the other hand, by [19, Theorem 4.2 (a)], (cid:107) v k − v (cid:63) (cid:107) ξ ≤ b − τ τ k < − τ τ (cid:15), (3.5)where the second inequality holds due to the proposed choice of k . Combining (3.4) and (3.5), weconclude that (cid:107) v ˆ π − v (cid:63) (cid:107) ξ < (cid:15) .A practical implementation of the VI algorithm requires a ﬁnite-state approximation such asa discretization of the state space. A review on such approximation methods can be found in arecent monograph [31]. Policy iteration (PI) is an alternative way to construct an (cid:15) -optimal policy. The PI algorithm canbe described as follows:1. Initialize π as an arbitrary policy in Π DS , and set k := 0;2. (Policy evaluation) Find the ﬁxed point v π k of T π k ;3. (Policy improvement) For each x ∈ X , set π k +1 ( x ) := ˜ u , where ˜ u is an optimal u of the semi-inﬁnite program (3.2) that computes ( T v π k )( x );4. If the stopping criterion is met, then stop and set ˜ π := π k +1 . Otherwise, set k ← k + 1 andgo to Step 2);1Here, the stopping criterion can be chosen as (cid:107) v π k − v π k − (cid:107) ξ < δ for a positive constant δ . Toperform the policy evaluation step (Step 2) in a computationally tractable manner, we reformulatethe inﬁnite-dimensional maximization problem in the deﬁnition of T π as ﬁnite dimensional by usingWasserstein DRO [3, 5]. Proposition 3.

Suppose that Assumption 1 holds and that v ∈ B ξ ( X ) . Then, the operator T π : B ξ ( X ) → B ξ ( X ) satisﬁes ( T π v )( x ) = sup ( w,q ) ∈ B (cid:104) c ( x , π ( x )) + αN N (cid:88) i =1 (cid:2) q v ( f ( x , π ( x ) , w ( i ) )) + q v ( f ( x , π ( x ) , w ( i ) )) (cid:3)(cid:105) , where B := (cid:8) ( w (1) , . . . , w ( N ) , w (1) , . . . , w ( N ) ) ∈ W N , q ∈ ∆ | N (cid:80) Ni =1 [ q d ( w ( i ) , ˆ w ( i ) ) p + q d ( w ( i ) , ˆ w ( i ) ) p ] ≤ θ p (cid:9) . This proposition follows immediately from [5, Corollary 2]. The optimization variables w (1) , . . . , w ( N ) , w (1) , . . . , w ( N ) can be interpreted as the probability atoms that characterize one of the worst-casedistributions. By the contraction property of T π k (Lemma 4), we can ﬁnd the ﬁxed point v π k of T π k by value iteration. In other words, we perform v τ +1 ← T π k v τ , τ = 0 , , . . . , until convergence.When computing T π k v τ , we solve the ﬁnite-dimensional optimization problem in Proposition 3 with v := v τ to completely remove the inﬁnite-dimensionality issue inherent in the deﬁnition of T π k . Inthe policy improvement step, we use the semi-inﬁnite program formulation of T in Proposition 1instead of directly solving the inﬁnite-dimensional minimax optimization problem in the deﬁnitionof T . It is well known that lim k →∞ (cid:107) v π k − v (cid:63) (cid:107) ξ = 0 under Assumption 1 by the monotonicity andcontraction properties of T and T π k (Lemmas 2 and 4) [32, Proposition 2.5.4].However, it is usually diﬃcult to ﬁnd the exact ﬁxed point v π k of T π k in the policy evaluationstep. Thus, we propose a modiﬁed PI algorithm, which is also called optimistic policy iteration [33,32]:1. Initialize ˜ v as an arbitrary function in B lsc ( X ) and { M k } as a sequence of positive integers,and set k := 1;2. (Policy improvement) For each x ∈ X , set π k ( x ) := ˜ u , where ˜ u is an optimal u of the semi-inﬁnite program (3.2) that computes ( T ˜ v k − )( x );3. (Policy evaluation) Compute ˜ v k := ( T π k ) M k ˜ v k − by solving the ﬁnite-dimensional optimization problems in Proposition 3;4. If the stopping criterion is met, then stop and set ˜ π := π k . Otherwise, set k ← k + 1 and goto Step 2);Note that the modiﬁed PI algorithm approximately evaluates the performance of a policy π k as˜ v k instead of ﬁnding the exact ﬁxed point of T π k . Concrete choices of the order sequence { M k } are discussed in [34]. However, for any choice of { M k } , the modiﬁed PI algorithm converges underAssumption 1 [32]: lim k →∞ (cid:107) ˜ v k − v (cid:63) (cid:107) ξ = 0 . As in the case of VI, we can estimate the number of iterations required for obtaining an (cid:15) -optimalpolicy.2

Proposition 4.

Suppose that Assumption 1 holds. Let r ∈ R be a positive constant such that (cid:107) ˜ v − T ˜ v (cid:107) ξ ≤ r. We assume that given (cid:15) > , the total number of iterations, k , in the modiﬁed PI algorithm satisﬁes kτ k < (1 − τ ) r (cid:15), where τ ∈ (0 , is the constant deﬁned in Lemma 4. Then, ˜ π := π k obtained by the modiﬁed PIalgorithm is an (cid:15) -optimal policy, i.e., (cid:107) v ˜ π − v (cid:63) (cid:107) ξ < (cid:15). Proof.

According to Lemma 4 and Theorem 1, we have v ˜ π , ˜ v k , v (cid:63) ∈ B ξ ( X ). By [32, Lemma 2.5.4],we obtain that ˜ v k − − kτ k − − τ rξ ≤ v (cid:63) ≤ ˜ v k − + τ k − − τ rξ, which implies that (cid:107) ˜ v k − − v (cid:63) (cid:107) ξ ≤ kτ k − − τ r. (3.6)On the other hand, ˜ π = π k is a greedy policy when the value function is chosen as ˜ v k − . As in theproof of Proposition 2, we have (cid:107) v ˜ π − v (cid:63) (cid:107) ξ ≤ τ − τ (cid:107) ˜ v k − − v (cid:63) (cid:107) ξ . Thus, by (3.6), (cid:107) v ˜ π − v (cid:63) (cid:107) ξ ≤ kτ k (1 − τ ) r < (cid:15), where the second inequality holds due to the proposed choice of k . Given a policy π ∈ Π DS (for Player I), the worst-case distribution policy (for Player II) can befound by solving sup γ ∈ Γ J x ( π, γ ) , which is an optimal control problem. By the dynamic programming principle, the worst-case valuefunction v π , deﬁned by (3.3), is the unique solution to the following Bellman equation: v π = T π v π under Assumption 1. The worst-case value function v π can be computed, for example, via valueiteration. Given v π , how can we characterize the worst-case distribution policy? The followingproposition indicates that, if the optimization problem involved in ( T π v π )( x ) admits an optimalsolution for all x ∈ X , then there exists an optimal policy for Player II, which is deterministic andstationary, and it generates a ﬁnitely-supported worst-case distribution. Proposition 5 (Worst-case distribution policy) . Suppose that Assumption 1 holds, and that given π ∈ Π DS sup µ ∈D (cid:20) c ( x , u ) + α (cid:90) W v π ( f ( x , π ( x ) , w ))d µ ( w ) (cid:21) admits an optimal solution for any x ∈ X . Then, the deterministic stationary policy γ π : X → D deﬁned by γ π ( x ) := 12 N N (cid:88) i =1 (cid:0) δ w π, ( i ) x + δ w π, ( i ) x (cid:1) ∀ x ∈ X is an optimal policy (for Player II) that generates a worst-case distribution for each state x ∈X , where w π x := ( w π, (1) x , . . . , w π, ( N ) x , w π, (1) x , . . . , w π, ( N ) x ) is an optimal solution of the maximizationproblem in Proposition 3 with v := v π . The existence of an optimal policy, which is deterministic and stationary, follows from thedynamic programming principle when the assumptions in the proposition hold. Thus, it is suﬃcientfor Player II to use the same worst-case distribution for all stages. The structure of γ π ( x ) is obtainedby applying [5, Corollary 1] to the maximization problem in the proposition. Note that the worst-case distribution of this form is consistent with the discussion below Proposition 3. By using [5,Corollary 2], we have the following sharper result of characterizing the worst-case distribution with N + 1 atoms: if the assumptions in Proposition 5 hold, one of the worst-case distribution policieshas the form γ π ( x ) := 1 N (cid:88) i (cid:54) = i δ w π, ( i ) x + p N δ w π, ( i x + 1 − p N δ w π, ( i x , where i ∈ { , . . . , N } , p ∈ [0 , w π, ( i ) x , w π, ( i ) x ∈ arg min w ∈W { λ (cid:63) d ( w, ˆ w ( i ) ) p − αv ( f ( x , π ( x ) , w )) } ,and w π, ( i ) x ∈ arg min w ∈W { λ (cid:63) d ( w, ˆ w ( i ) ) p − αv ( f ( x , π ( x ) , w )) } for all i (cid:54) = i . Here, λ (cid:63) is a dualminimizer, which must exist when the worst-case distribution exists [5, Corollary 1].It is worth mentioning that Kantorovich duality and DP play a critical role in obtaining all theresults in this section. Based on the reformulation results and analytical properties of DR-controlproblems, we demonstrate their utility in the following sections. A potential defect of the SAA-control formulation (2.2) is that its optimal policy may not performwell if a testing dataset of w t is diﬀerent from the training dataset { ˆ w (1) , . . . , ˆ w ( N ) } . This issueoccurs even when the testing and training datasets are sampled from the same distribution. Sucha degradation of the optimal decisions in out-of-sample tests is often called the optimizer’s curse in the literature of decision analysis [35]. We show that an optimal distributionally robust policycan alleviate this issue and provide a guaranteed out-of-sample performance if the radius θ ofWasserstein ambiguity set is carefully determined.Let π (cid:63) ˆ w ∈ Π denote an optimal distributionally robust policy obtained by using the trainingdataset ˆ w := { ˆ w (1) , . . . , ˆ w ( N ) } of N samples. The out-of-sample performance of π (cid:63) is measured as E π (cid:63) ˆ w w t ∼ µ (cid:20) ∞ (cid:88) t =0 α t c ( x t , u t ) | x = x (cid:21) , (4.1)which represents the expected total cost under a new sample that is generated (according to µ )independent of the training dataset. Unfortunately, the out-of-sample performance cannot be pre-cisely computed because the true distribution µ is unknown. Thus, instead, we aim at establishinga probabilistic out-of-sample performance guarantee of the form: µ N (cid:26) ˆ w | E π (cid:63) ˆ w w t ∼ µ (cid:20) ∞ (cid:88) t =0 α t c ( x t , u t ) | x = x (cid:21) ≤ v (cid:63) ˆ w ( x ) ∀ x ∈ X (cid:27) ≥ − β, (4.2)4where v (cid:63) ˆ w denotes the optimal value function of the DR-control problem with the training datasetˆ w := { ˆ w (1) , . . . , ˆ w ( N ) } , and β ∈ (0 , The inequality represents a bound (1 − β ) on the probabilitythat the expected cost incurred by π (cid:63) is no greater than the optimal value function. Note thatthe probability and the expected cost are evaluated with respect to the true distribution µ . Thus,this inequality provides a probabilistic bound on the performance of π (cid:63) evaluated with unseentest samples drawn from µ . Here, v (cid:63) ˆ w , which depends on θ , plays the role of a certiﬁcate for theout-of-sample performance.Our goal is to identify conditions on the radius θ under which an optimal distributionallyrobust policy provides the probabilistic performance guarantee. We begin by imposing the followingassumption on the true distribution µ : Assumption 2 (Light tail) . There exists a positive constant q > p such that ρ := (cid:90) W exp( (cid:107) w (cid:107) q ) d µ ( w ) < + ∞ . This assumption implies that the tail of µ decays exponentially. Under this condition, thefollowing measure concentration inequality holds: Theorem 2 (Measure concentration, Theorem 2 in [36]) . Suppose that Assumption 2 holds. Let ν N := 1 N N (cid:88) i =1 δ ˆ w ( i ) . Then, µ N (cid:8) ˆ w | W p ( µ, ν N ) ≥ θ (cid:9) ≤ c (cid:2) b ( N, θ ) { θ ≤ } + b ( N, θ ) { θ> } (cid:3) , where b ( N, θ ) :=  exp( − c N θ ) if p > l/ − c N ( θ log(2+1 /θ ) ) ) if p = l/ − c N θ l/p ) otherwise , and b ( N, θ ) := exp( − c N θ q/p ) . Here, c , c are positive constants depending only on l , q and ρ . This theorem provides an upper-bound of the probability that the true distribution µ lies outsideof the Wasserstein ambiguity set D . The measure concentration inequality provides a systematicmeans to determine the radius for D to contain the true distribution µ with probability no lessthan (1 − β ). As shown in the following theorem, the contraction property of Bellman operatorsenables us to extend the single-stage out-of-performance guarantee to its multi-stage counterpartwith no additional requirement on θ . Theorem 3 (Out-of-sample performance guarantee) . Suppose that Assumptions 1 and 2 hold. Let π (cid:63) ˆ w and v (cid:63) ˆ w denote an optimal policy and the optimal value function of the DR-control problem (2.6) Here, ˆ w , π (cid:63) ˆ w and v (cid:63) ˆ w are viewed as random objects. with the training dataset ˆ w := { ˆ w (1) , . . . , ˆ w ( N ) } and the following Wasserstein ball radius: θ ( N, β ) :=  (cid:2) Nc log( c β ) (cid:3) p/q if N < c log( c β ) (cid:2) Nc log( c β ) (cid:3) / if N ≥ c log( c β ) ∧ p > l (cid:2) Nc log( c β ) (cid:3) p/l if N ≥ c log( c β ) ∧ p < l ¯ θ if N ≥ (log 3) c log( c β ) ∧ p = l , where ¯ θ satisﬁes ¯ θ log(2+1 / ¯ θ ) = [ Nc log( c β )] / , and c , c are the positive constants in Theorem 2. Then, the probabilistic out-of-sample performance guarantee (4.2) holds.Proof.

Using Theorem 2, we can conﬁrm that our choice of θ provides the following probabilisticguarantee: µ N (cid:8) ˆ w | W p ( µ, ν N ) ≤ θ ( N, β ) (cid:9) ≥ − β. (4.3)Deﬁne an operator T (cid:63) : B ξ ( X ) → B ξ ( X ) as ( T (cid:63) v )( x ) := E µ [ c ( x , π (cid:63) ˆ w ( x )) + αv ( f ( x , π (cid:63) ˆ w ( x ) , w ))] forall x ∈ X . It follows from (4.3) that the following single-stage guarantee holds: µ N (cid:8) ˆ w | ( T (cid:63) v (cid:63) ˆ w )( x ) ≤ ( T v (cid:63) ˆ w )( x ) (cid:9) ≥ − β (4.4)given any ﬁxed x ∈ X . It is straightforward to check under Assumption 1 that T (cid:63) is a monotonecontraction mapping.We now show that if µ ∈ D , then ( T (cid:63) ) k v (cid:63) ˆ w ≤ v (cid:63) ˆ w for any k = 1 , , . . . using mathematicalinduction. For k = 1, we have T (cid:63) v (cid:63) ˆ w ≤ T v (cid:63) ˆ w = v (cid:63) ˆ w by the minimax deﬁnition of T . Suppose nowthat the induction hypothesis holds for some k . By the monotonicity of T (cid:63) and the deﬁnition of T ,we have T (cid:63) (( T (cid:63) ) k v (cid:63) ˆ w ) ≤ T (cid:63) v (cid:63) ˆ w ≤ T v (cid:63) ˆ w = v (cid:63) ˆ w , and thus the induction hypothesis is valid for k + 1.We now notice thatlim k →∞ (( T (cid:63) ) k v (cid:63) ˆ w )( x ) = E w t ∼ µ (cid:20) ∞ (cid:88) t =0 α t c ( x t , π (cid:63) ˆ w ( x t )) | x = x (cid:21) since T (cid:63) is a contraction mapping under Assumption 1. Therefore, if µ ∈ D , then E w t ∼ µ (cid:20) ∞ (cid:88) t =0 α t c ( x t , π (cid:63) ˆ w ( x t )) | x = x (cid:21) ≤ v (cid:63) ˆ w ( x ) ∀ x ∈ X . By (4.3), the probabilistic performance guarantee holds as desired.

Remark 1.

Note that the contraction property of T and T (cid:63) plays a critical role in connectingthe single-stage performance guarantee (4.4) to the multi-stage guarantee (4.2) in a way that isindependent of the number of stages. This is a quite powerful result, because if we have a radius θ that provides a desirable conﬁdence level (1 − β ) in the single-stage guarantee, we can use thesame radius to achieve the same level of conﬁdence in the multi-stage guarantee with no additionalrequirement. This choice includes the radius proposed in [3] in the single-stage setting as a special case (when p = 1 and l (cid:54) = 2). The constants c and c in Theorem 2 can be calculated using the proof of Theorem 2 in [36]. However, thiscalculation is often conservative and thus results in a smaller radius θ ( N, β ) than necessary. Bootstrapping andcross-validation methods can be used to reduce the conservativeness in the a priori bound θ ( N, β ), as advocated anddemonstrated in [3]. We now consider a slightly diﬀerent version of the DR-control problem, which can be consideredas a relaxation of (2.6) with a ﬁxed penalty parameter λ > π ∈ Π sup γ ∈ Γ (cid:48) E π,γ (cid:20) ∞ (cid:88) t =0 α t ( c ( x t , u t ) − λW p ( µ t , ν N ) p ) | x = x (cid:21) , where the strategy space Γ (cid:48) := { γ := ( γ , γ , . . . ) | γ t ( P ( W ) | h et ) = 1 ∀ h et ∈ H et } of Player II no longerdepends on a Wasserstein ambiguity set. Instead of using an explicit ambiguity set D , Player IIis penalized by λW p ( µ t , ν N ) p , which can be interpreted as the cost of perturbing the empiricaldistribution ν N . Under Assumption 1, the Bellman operator T (cid:48) λ : B ξ ( X ) → B ξ ( X ) of the Wasserstein penaltyproblem is deﬁned by( T (cid:48) λ v )( x ) := inf u ∈U ( x ) sup µ ∈P ( W ) E µ (cid:2) c ( x , u ) − λW p ( µ , ν N ) p + αv ( f ( x , u , w )) (cid:3) . for all x ∈ X . By using the strong duality result [5, Theorem 1], we have the following equivalentform of T (cid:48) λ : Proposition 6.

Suppose that the function w (cid:55)→ v ( f ( x , u , w )) lies in L (d ν N ) for each ( x , u ) ∈ K .Then, the Bellman operator T (cid:48) λ can be expressed as ( T (cid:48) λ v )( x ) = inf u ∈U ( x ) (cid:20) c ( x , u ) + 1 N N (cid:88) i =1 sup w (cid:48) ∈W [ αv ( f ( x , u , w (cid:48) )) − λd ( ˆ w ( i ) , w (cid:48) ) p ] (cid:21) for all x ∈ X . Furthermore, we have ( T v )( x ) = inf λ ≥ [( T (cid:48) λ v )( x ) + λθ p ] ∀ x ∈ X . By the results of [19] in the general minimax control setting, the optimal value function v (cid:48) is the unique ﬁxed point (in B lsc ( X )) of T (cid:48) λ under Assumption 1 because T (cid:48) λ is a contraction.We can use value iteration to evaluate v (cid:48) due to the Banach ﬁxed point theorem. Analogousto Theorem 1, there exists a deterministic stationary policy π (cid:48) , which is optimal, where π (cid:48) ( x ) ∈ arg min u ∈U ( x ) [ c ( x , u ) + N (cid:80) Ni =1 sup w (cid:48) ∈W [ αv (cid:48) ( f ( x , u , w (cid:48) )) − λd ( ˆ w ( i ) , w (cid:48) ) p ]] for all x ∈ X , underAssumption 1. We now develop a solution approach, using a Riccati-type equation, to linear-quadratic (LQ) prob-lems with the Wasserstein penalty when d ( w, w (cid:48) ) p := (cid:107) w − w (cid:48) (cid:107) , where (cid:107) · (cid:107) denotes the Euclidean norm on R l . Consider a linear system of the form x t +1 = Ax t + Bu t + Ξ w t , (5.1)7where A ∈ R n × n , B ∈ R n × m , and Ξ ∈ R n × l . We also choose the following quadratic stage-wise costfunction: c ( x t , u t ) = x (cid:62) t Qx t + u (cid:62) t Ru t , (5.2)where Q = Q (cid:62) ∈ R n × n is positive semideﬁnite, and R = R (cid:62) ∈ R m × m is positive deﬁnite. For thesake of simplicity, we assume that E w ∼ ν N [ w ] = N (cid:80) Ni =1 ˆ w ( i ) = 0. The case of non-zero mean isconsidered in Appendix B. Let Σ := E w ∼ ν N [ ww (cid:62) ] = N (cid:80) Ni =1 ˆ w ( i ) ( ˆ w ( i ) ) (cid:62) . In the LQ setting, we alsoset X := R n , U ( x ) ≡ U := R m , and W := R l . Note that, unlike the standard LQG, the LQ problemswith Wasserstein penalty do not assume that the probability distribution of random disturbancesis Gaussian. In fact, the main motivation of this distributionally robust LQ formulation is to relaxthe assumption of Gaussian disturbance distributions in LQG, and to obtain a useful control policywhen the true distribution deviates from a Gaussian distribution.By using DP, we obtain the following explicit solution of the LQ problem: Theorem 4.

Suppose that there exists a symmetric positive semideﬁnite matrix P ∈ R n × n thatsolves the following equation: P = Q + αA (cid:62) P A + α A (cid:62) SA (5.3) with S := P Ξ( λI − α Ξ (cid:62) P Ξ) − Ξ (cid:62) P − [ I + α Ξ( λI − α Ξ (cid:62) P Ξ) − Ξ (cid:62) P ] (cid:62) P B × [ R + αB (cid:62) { P + αP Ξ( λI − α Ξ (cid:62) P Ξ) − Ξ (cid:62) P } B ] − × B (cid:62) P [ I + α Ξ( λI − α Ξ (cid:62) P Ξ) − Ξ (cid:62) P ] for a suﬃciently large λ . Then, v (cid:48) ( x ) := x (cid:62) P x + z solves the Bellman equation, where z := λ − α tr [ { λ ( λI − α Ξ (cid:62) P Ξ) − − I } Σ] . If, in addition, v (cid:48) is the optimal value function, then the uniqueoptimal policy π (cid:48) is given by π (cid:48) ( x ) = K x ∀ x ∈ R n , where K := − [ R + αB (cid:62) { P + αP Ξ( λI − α Ξ (cid:62) P Ξ) − Ξ (cid:62) P } B ] − × αB (cid:62) P (cid:62) [ I + α Ξ( λI − α Ξ (cid:62) P Ξ) − Ξ (cid:62) P ] A. Furthermore, if we let w (cid:48) ( i ) x := ( λI − α Ξ (cid:62) P Ξ) − [ α Ξ (cid:62) P ( A + BK ) x + λ ˆ w ( i ) ] , the deterministic stationary policy γ (cid:48) ∈ Γ (cid:48) , deﬁned as γ (cid:48) ( x ) = 1 N N (cid:88) i =1 δ w (cid:48) ( i ) x ∀ x ∈ R n , is an optimal policy for Player II that generates a worst-case distribution for each x ∈ R n . Suﬃcient conditions for v (cid:48) to be the optimal value function are provided in [37]. Under the stabilizability andobservability conditions, the algebraic Riccati equation has a unique positive semideﬁnite solution as well. linear in the system state. Furthermore, the control gain matrix K is independent of thecovariance matrix Σ as in standard LQG. The worst-case distribution’s support elements w (cid:48) ( i ) x ’s areaﬃne in the system state. More speciﬁcally, w (cid:48) ( i ) x is obtained by scaling the i th data sample ˆ w ( i ) ∈ R l by the factor of ( λI − α Ξ (cid:62) P Ξ) − λ and shifting it by the vector ( λI − α Ξ (cid:62) P Ξ) − α Ξ (cid:62) P ( A + BK ) x ,which is linear in the system state. Distributional robustness is controlled by the penalty parameter λ : As λ increases, the permissible deviation of µ t from ν N decreases. This is equivalent to decreasingthe Wasserstein ball radius θ in the original DR-control setting. Thus, by letting λ tend to + ∞ ,the optimal distributionally robust policy for the LQ problem converges pointwise to the standardLQ optimal control policy. Proposition 7.

Suppose that ( A, B ) is stabilizable and ( A, C ) is observable, where Q = C (cid:62) C .Let ¯ P be the unique symmetric positive deﬁnite solution of the following discrete algebraic Riccatiequation: ¯ P = Q + αA (cid:62) ¯ P A − α A (cid:62) ¯ P B ( R + αB (cid:62) ¯ P B ) − B (cid:62) ¯ P A, (5.4) and let ¯ K := − α ( R + αB (cid:62) ¯ P B ) − B (cid:62) ¯ P A.

Then, for each x ∈ X π (cid:48) ( x ) → ¯ K x w (cid:48) x → ˆ w x (5.5) as λ → ∞ , where π (cid:48) and w (cid:48) x are deﬁned in Theorem 4.Proof. Let P λ denote a symmetric positive semideﬁnite solution of (5.3) given any ﬁxed λ ≥ ¯ λ . As λ tends to + ∞ , the right-hand side of (5.3) tends to Q + αA (cid:62) P λ A − α A (cid:62) P λ B ( R + αB (cid:62) P λ B ) − B (cid:62) P λ A , which corresponds to the right-hand side of (5.4) with ¯ P = P λ . Therefore, P λ solves the algebraic Riccati equation (5.4) as λ → ∞ . On the other hand, (5.4) admits a uniquepositive deﬁnite solution when ( A, C ) is observable and (

A, B ) is stabilizable (e.g., [38, Section 2.4]).Thus, P λ converges to ¯ P as λ → ∞ . Likewise, we can show that the feedback gain matrix K andthe worst-case distribution’s support element w (cid:48) ( i ) x (deﬁned in Theorem 4) tend to ¯ K and ˆ w ( i ) ,respectively, as λ → ∞ . Therefore, the result follows. We ﬁrst demonstrate the performance and utility of DR-control through an investment-consumptionproblem (e.g., [39, 40]). Let x t be the wealth of an investor at stage t . The investor wishes to decidethe amount u ,t to be invested in a risky asset (with an i.i.d. random rate of return, w t ) and theamount u ,t to be consumed at stage t . The remaining amount ( x t − u ,t − u ,t ) is automaticallyre-invested into a riskless asset with a deterministic rate of return, η . Then, the investor’s wealthevolves as x t +1 = η ( x t − u ,t − u ,t ) + w t u ,t . We assume that the control actions u ,t and u ,t satisfy the following constraints: u ,t + u ,t ≤ x t , u ,t , u ,t ≥ ∀ t, (a) (b) θ R e li ab ili t y N=10N=20 θ -0.9-0.895-0.89-0.885-0.88-0.875-0.87-0.865 O u t - o f - s a m p l e pe r f o r m an c e ( c o s t ) N=10N=20

Figure 1: Depending on the radius θ and the number of samples N , (a) the reliability µ N { ˆ w | E π (cid:63) ˆ w w t ∼ µ [ (cid:80) ∞ t =0 α t r ( x t , u t ) | x = x ] ≤ v (cid:63) ˆ w ( x ) } , and (b) the out-of-sample performance (cost)of π (cid:63) ˆ w .i.e., U ( x ) := { u := ( u , u ) ∈ R | u + u ≤ x , u ≥ } .The cost function is given by the following negative expected utility from consumption: J ( π, γ ) := − E π,γ (cid:20) ∞ (cid:88) t =0 α t U ( u ,t ) (cid:21) , where the utility function U : R → R is selected as U ( c ) = c − ζc . The following parametersare used in the numerical simulations: ζ = 0 . α = 0 . η = 1 .

02, and p = 1. The datasamples { ˆ w (1) , . . . , ˆ w ( N ) } of w t are generated according to the normal distribution N (1 . , . ).We numerically approximate the optimal value function v (cid:63) ˆ w and the corresponding optimal policy π (cid:63) ˆ w on a computational grid by using the convex optimization approach in [41]. This methodapproximates the Bellman operator by the optimal value of a convex program with a uniformconvergence property. Furthermore, it does not require any explicit interpolation in evaluating thevalue function and control policies at some state other than the grid points, by using an auxiliaryoptimization variable to assign the contribution of each grid point to the next state.The numerical experiments were conducted on a Mac with 4.2 GHz Intel Core i7 and 64GBRAM. The amount of time required for simulations with diﬀerent grid sizes and N = 10 are reportedin TABLE 1. For the rest of the simulations, we used 71 states (with grid spacing 0.02). To demonstrate the out-of-sample performance guarantee of an optimal distributionally robustpolicy, we compute the following reliability of π (cid:63) ˆ w : µ N (cid:26) ˆ w | E π (cid:63) ˆ w w t ∼ µ (cid:20) ∞ (cid:88) t =0 α t c ( x t , u t ) | x = x (cid:21) ≤ v (cid:63) ˆ w ( x ) (cid:27) ,

10 20 30 40 -0.9-0.88-0.86-0.84-0.82-0.8 O u t - o f - s a m p l e pe r f o r m an c e ( c o s t ) SAADR

Figure 2: The out-of-sample performance (cost) of the optimal SAA policy π SAA ˆ w ( ◦ ) and the optimaldistributionally robust policy π (cid:63) ˆ w ( (cid:5) ) depending on N .which represents the probability that the expected cost incurred by π (cid:63) ˆ w under the true distribution µ is no greater than v (cid:63) ˆ w ( x ). As shown in Fig. 1 (a), the reliability increases with the Wasserstein ballradius θ and the number N of samples. This result is consistent with Theorem 3. Our numericalexperiments also conﬁrm that the same radius θ can be used to achieve the same level of reliabilityin both single-stage and multi-stage settings as indicated in the theorem.Fig. 1 (b) illustrates the out-of-sample cost (4.1) of π (cid:63) ˆ w with respect to θ and N . Interestingly,the out-of-sample cost does not monotonically decrease with θ . For a too-small radius, the resultingDR-policy is not suﬃciently robust to obtain the best out-of-sample performance (i.e., the leastout-of-sample cost). On the other hand, if a too-large Wasserstein ambiguity set is selected, theresulting DR-policy is overly conservative and thus sacriﬁces the closed-loop performance. Thus,there exists an optimal radius (e.g., 0 .

02 in the case of N = 20) that provides the best out-of-sampleperformance. To compare DR-control (2.6) with SAA-control (2.2), we ﬁrst compute the out-of-sample perfor-mance of π (cid:63) ˆ w and that of the corresponding optimal SAA policy π SAA ˆ w obtained by using the sametraining dataset ˆ w . The radius is selected as the one that provides the best out-of-sample perfor-mance. As shown in Fig. 2, the proposed DR-policy achieves 8% lower out-of-sample cost thanthe SAA-policy when N = 10. As expected, the gap between the two decreases with the num-ber of samples. Note that the proposed DR-policy designed even with a small number of samples( N = 10) maintains its performance under the test dataset that is generated independent of thetraining dataset, unlike the corresponding SAA-policy. Consider an electric power transmission system with N buses (and ¯ n generator buses). This systemmay be subject to ambiguous uncertainty generated from variable renewable energy sources suchas wind and solar. For the frequency regulation of this system, we use the proposed Wassersteinpenalty method to control the mechanical power input of generator. Let θ i and P e,i be the voltage This observation is consistent with the single-stage case in Section 7.2 of [3]. i .The swing equation of this system is then given by M i ¨ θ i ( t ) + D i ˙ θ i ( t ) = P m,i ( t ) − P e,i ( t ) ∀ i = 1 , . . . , ¯ n, (6.1)where M i and D i denote the inertia coeﬃcient (in pu · sec /rad) and the damping coeﬃcient (inpu · sec/rad) of the generator at bus i . Here, P e,i is the electrical active power injection (in per unit)at bus i and is given by P e,i := (cid:80) Nj =1 | V i || V j | ( G ij cos( θ i − θ j ) + B ij sin( θ i − θ j )), where G ij and B ij are the conductance and susceptance of the transmission line connecting buses i and j , respectively,and V i is the voltage at bus i . Assuming that all the voltage magnitudes are 1 per unit, the anglediﬀerences | θ i − θ j | ’s are small, and all the transmission lines are (almost) lossless, the AC powerﬂow equation can be approximated by the following linearized DC power ﬂow equation: P e,i := N (cid:88) j =1 B ij ( θ i − θ j ) or P e = L θ , (6.2)where P e := ( P e, , . . . , P e, ¯ n ), θ := ( θ , . . . , θ ¯ n ), and L ∈ R ¯ n × ¯ n is the Kron-reduced Laplacian matrixof this power network. Let x ( t ) := ( θ ( t ) (cid:62) , ˙ θ ( t ) (cid:62) ) (cid:62) ∈ R n and u ( t ) := P m ( t ) ∈ R ¯ n . By combining (6.1) and (6.2), weobtain the following state-space model of the power system (e.g., [44]):˙ x ( t ) = (cid:20) I − M − L M − D (cid:21) x ( t ) + (cid:20) M − (cid:21) u ( t ) , where M := diag( M , . . . , M ¯ n ) and D := diag( D , . . . , D ¯ n ). We discretize this system using zero-order hold on the input and a sampling time of 0 . A and B of thefollowing discrete-time system model (5.1): x t +1 = Ax t + B ( u t + w t )where w i,t is the random disturbance (in per unit) at bus i at stage t . It can model uncertain powerinjections generated by solar or wind energy sources.The state-dependent portion of the quadratic cost function (5.2) is chosen as x (cid:62) Qx := θ (cid:62) [ I − (cid:62) / ¯ n ] θ + 12 ˙ θ (cid:62) M ˙ θ , where denotes the ¯ n -dimensional vector of all ones, the ﬁrst term measures the deviation of rotorangles from their average ¯ θ := (cid:62) θ / ¯ n , and the second term corresponds to the kinetic energy storedin the electro-mechanical generators [45]. The matrix R is chosen to be the ¯ n by ¯ n identity matrix.The IEEE 39-bus New England test case (with 10 generator buses, 29 load buses, and 40transmission lines) is used to demonstrate the performance of the proposed LQ control π (cid:48) ˆ w withWasserstein penalty. The initial values of voltage angles θ (0) are determined by solving the (steady-state) power ﬂow problem using MATPOWER [46]. The initial frequency is set to be zero for allbuses except bus 1 at which ˙ θ (0) := 0 . α = 0 . The Kron reduction is used to express the system in the reduced dimension ¯ n by focusing on the interactions ofthe generator buses [42]. More precisely, we can obtain the Kron-reduced admittance matrix Y Kron , by eliminatingnongenerator bus k , as Y Kron ij := Y ij − Y ik Y kj /Y kk for all i, j = 1 , . . . , N such that i, j (cid:54) = k . The Kron-reducedLaplacian can then be obtained by setting L ii := (cid:80) k =1 ,..., ¯ n : k (cid:54) = i B Kron ik and L ij := − B Kron ij for i (cid:54) = j , where B Kron denotes the susceptance of the Kron-reduced admittance matrix [43]. time (s) -0.2-0.100.10.2 f r equen cy ( pe r un i t ) time (s) -0.2-0.100.10.2 f r equen cy ( pe r un i t ) (a)(b) Figure 3: The box plot of frequency deviation ˙ θ controlled by (a) the standard LQG control policy π LQGˆ w , and (b) the optimal DR-control policy π (cid:48) ˆ w with Wasserstein penalty, under the worst-casedistribution policy.Table 2: The amount of time (in seconds) required to decrease and maintain the mean frequencydeviation less than 1% Bus 1 2 3 4 5 6 7 8 9 10 π LQGˆ w π (cid:48) ˆ w We ﬁrst compare the standard LQG control policy π LQGˆ w and the proposed DR-control policy π (cid:48) ˆ w with the Wasserstein penalty under the worst-case distribution policy γ (cid:48) ˆ w obtained by using theproof of Theorem 4. We set N = 10 and λ = 0 .

03. The i.i.d. samples { ˆ w ( i ) } Ni =1 are generatedaccording to the normal distribution N (0 , . I ). As depicted in Fig. 3, π (cid:48) ˆ w is less sensitive than π LQGˆ w against the worst-case distribution policy. In the [0 ,

24] (seconds) interval, the frequencycontrolled by π LQGˆ w ﬂuctuates around non-zero values while π (cid:48) ˆ w maintains the frequency ﬂuctuationcentered approximately around zero. This is because the proposed DR-method takes into accountthe possibility of nonzero-mean disturbances, while the standard LQG method assumes zero-meandisturbances. Furthermore, the proposed DR-method suppress the frequency ﬂuctuation much The central bar on each box indicates the median; the bottom and top edges of the box indicate the 25th and75th percentiles, respectively; and the ‘+’ symbol represents the outliers. The frequency deviation at other buses displays a similar behavior. λ R e li ab ili t y N=10N=20N=40

Figure 4: The reliability µ N { ˆ w | E π (cid:48) ˆ w w t ∼ µ [ (cid:80) ∞ t =0 α t c ( x t , u t ) | x = x ] ≤ v (cid:48) ˆ w ( x ) } , in the Wassersteinpenalty case, depending on λ and N .faster than the standard LQG method: Under π (cid:48) ˆ w , the mean frequency deviation averaging acrossthe buses is less than 1% for any time after 16.7 seconds. On the other hand, if the standard LQGcontrol is used, it takes 41.8 seconds to take the mean frequency deviation (averaging across thebuses) below 1%. The detailed results for each bus are reported in Table 2. We now examine the out-of-sample performance of π (cid:48) ˆ w and how it depends on the penalty parameter λ and the number N of samples. The i.i.d. samples { ˆ w ( i ) } Ni =1 are generated according to the normaldistribution N (0 , I ). Given λ and N , we deﬁne the reliability of π (cid:48) ˆ w as µ N (cid:26) ˆ w | E π (cid:48) ˆ w w t ∼ µ (cid:20) ∞ (cid:88) t =0 α t c ( x t , u t ) | x = x (cid:21) ≤ v (cid:48) ˆ w ( x ) (cid:27) . As shown in Fig. 4, the reliability decreases with λ . This is because when using larger λ , thecontrol policy π (cid:48) ˆ w becomes less robust against the deviation of the empirical distribution from thetrue distribution. Increasing λ has the eﬀect of decreasing the radius θ in DR-control. In addition,the reliability tends to increase as the number N of samples used to design π (cid:48) ˆ w increases. This resultis consistent with the dependency of the DR-control reliability on the number of samples. By usingthis result, we can determine the penalty parameter to attain a desired out-of-sample performanceguarantee (or reliability), given the number of samples. In this paper, we considered distributionally robust stochastic control problems with Wassersteinambiguity sets by directly using the data samples of uncertain variables. We showed that theproposed framework has several salient features, including ( i ) computational tractability with errorbounds, ( ii ) an out-of-sample performance guarantee, and ( iii ) an explicit solution in the LQsetting. It is worth emphasizing that the Kantorovich duality principle plays a critical role in our DPsolution and analysis. Furthermore, with regard to the out-of-sample performance guarantee, ouranalysis provides the unique insight that the contraction property of the Bellman operators extends4a single-stage guarantee—obtained using a measure concentration inequality—to the correspondingmulti-stage guarantee without any degradation in the conﬁdence level. A Proof of Lemma 1

Proof.

Recall that using the Kantorovich duality principle, the Wasserstein distance between µ and ν can be written as W ( µ, ν N ) = sup ϕ,ψ ∈ Φ (cid:26) (cid:90) W ϕ ( w ) d µ ( w ) + (cid:90) W ψ ( w (cid:48) ) d ν N ( w (cid:48) ) (cid:27) , where Φ := { ( ϕ, ψ ) ∈ L (d µ ) × L (d ν N ) | ϕ ( w ) + ψ ( w (cid:48) ) ≤ d ( w, w (cid:48) ) p ∀ w, w (cid:48) ∈ W} . Letˆ D := (cid:26) µ ∈ P ( W ) (cid:12)(cid:12)(cid:12)(cid:12) (cid:90) W ϕ ( w ) d µ ( w ) + (cid:90) W inf w ∈W [ d ( w, w (cid:48) ) p − ϕ ( w )] d ν N ( w (cid:48) ) ≤ θ p ∀ ϕ ∈ L (d µ ) (cid:27) . We claim that ˆ D = D . Choose an arbitrary µ from ˆ D . Note that for any ( ϕ, ψ ) ∈ Φ d , ψ ( w (cid:48) ) ≤ inf w ∈W [ d ( w, w (cid:48) ) p − ϕ ( w )] ∀ w (cid:48) ∈ W . Thus, we have W ( µ, ν N ) ≤ sup ϕ ∈ L (d µ ) (cid:26) (cid:90) W ϕ ( w ) d µ ( w ) + (cid:90) W inf w ∈W [ d ( w, w (cid:48) ) p − ϕ ( w )] d ν N ( w (cid:48) ) (cid:27) ≤ θ p , where the last inequality holds becase µ ∈ ˆ D . Therefore, µ ∈ D , which implies that ˆ D ⊆ D .We now select an arbitrary µ from D . Fix ϕ ∈ L (d µ ) and deﬁne a function ˆ ψ : W → R byˆ ψ ( w (cid:48) ) := inf w ∈W [ d ( w, w (cid:48) ) p − ϕ ( w )] ∀ w (cid:48) ∈ W . Then, ˆ ψ ∈ L (d µ ) and ( ϕ, ˆ ψ ) ∈ Φ. Thus, (cid:90) W ϕ ( w ) d µ ( w ) + (cid:90) W ˆ ψ ( w (cid:48) ) d ν N ( w (cid:48) ) ≤ W ( µ, ν N ) ≤ θ p , which holds for any ϕ ∈ L (d µ ). By the deﬁnition of ˆ ψ , this implies that µ ∈ ˆ D . Therefore, D ⊆ ˆ D . B Linear-Quadratic Problems

Proof of Theorem 4.

Let v (cid:48) : R n → R be deﬁned as v (cid:48) ( x ) := x (cid:62) P x + z . To compute T (cid:48) λ v (cid:48) , we ﬁrstcalculate the inner maximization part in Proposition 6 as follows: φ ( u , w ) := sup w (cid:48) ∈ R l (cid:2) αv ( f ( x , u , w (cid:48) )) − λd ( w, w (cid:48) ) p (cid:3) = sup w (cid:48) ∈ R l (cid:2) α ( A x + B u + Ξ w (cid:48) ) (cid:62) P ( A x + B u + Ξ w (cid:48) ) + αz − λ (cid:107) w − w (cid:48) (cid:107) (cid:3) . There exists a constant ¯ λ > P ) such that for any λ ≥ ¯ λ , the objective function ofthe maximization problem above is strictly concave in w (cid:48) (i.e., λI − α Ξ (cid:62) P Ξ is positive deﬁnite),and thus the unique maximizer is given by w (cid:63) := ( λI − α Ξ (cid:62) P Ξ) − [ α Ξ (cid:62) P ( A x + B u ) + λw ] . (B.1)5With this maximizer, we can rewrite the term φ ( u , w ) as φ ( u , w ) = α [ x (cid:62) A (cid:62) P A x + u (cid:62) B (cid:62) P B u + 2 x (cid:62) A (cid:62) P B u + z ]+ [ α Ξ (cid:62) P ( A x + B u ) + λw ] (cid:62) ( λI − α Ξ (cid:62) P Ξ) − [ α Ξ (cid:62) P ( A x + B u ) + λw ] − λ (cid:107) w (cid:107) . Since E w ∼ ν N [ w ] = 0 and E w ∼ ν N [ ww (cid:62) ] = Σ, we have E w ∼ ν N [ φ ( u , w )] = u (cid:62) [ αB (cid:62) P B + α B (cid:62) P Ξ( λI − α Ξ (cid:62) P Ξ) − Ξ (cid:62) P B ] u + 2 α [ x (cid:62) A (cid:62) + α x (cid:62) A (cid:62) P Ξ( λI − α Ξ (cid:62) P Ξ) − Ξ (cid:62) ] P B u + x (cid:62) [ αA (cid:62) P A + α A (cid:62) P Ξ( λI − α Ξ (cid:62) P Ξ) − Ξ (cid:62) P A ] x + αz + λ tr[( λI − α Ξ (cid:62) P Ξ) − Σ] − λ tr[Σ] . Recall that ( T (cid:48) v (cid:48) )( x ) = inf u ∈U ( x ) (cid:2) r ( x , u ) + E w ∼ ν N [ φ ( u , w )] (cid:3) . (B.2)We notice R + αB (cid:62) P B + α B (cid:62) P Ξ( λI − α Ξ (cid:62) P Ξ) − Ξ (cid:62) P B is positive deﬁnite for λ ≥ ¯ λ because R is positive deﬁnite and λI − α Ξ (cid:62) P Ξ is positive deﬁnite for λ ≥ ¯ λ . Thus, the objective function in(B.2) is strictly convex in u and has the unique minimizer u (cid:63) = K x . Therefore, we obtain that( T (cid:48) v (cid:48) )( x ) = x (cid:62) ( Q + αA (cid:62) P A + α A (cid:62) SA ) x + αz + λ tr[ { λ ( λI − α Ξ (cid:62) P Ξ) − − I } Σ] . We conclude that v (cid:48) solves the Bellman equation since P and z satisfy P = Q + αA (cid:62) P A + α A (cid:62) SA and (1 − α ) z = λ tr[ { λ ( λI − α Ξ (cid:62) P Ξ) − − I } Σ]. Furthermore, when v (cid:48) is the optimal value function,the value of an optimal policy π (cid:48) at x ∈ R n is uniquely given by u (cid:63) , i.e., π (cid:48) ( x ) = K x .We now characterize the worst-case distribution policy. Plugging w = ˆ w ( i ) and u = K x into(B.1), we obtain that w (cid:48) ( i ) x = ( λI − α Ξ (cid:62) P Ξ) − [ α Ξ (cid:62) P ( A x + BK x ) + λ ˆ w ( i ) ] . Let γ (cid:48) ( x ) := N (cid:80) Ni =1 δ w (cid:48) ( i ) x for all x ∈ X . Then, W ( γ (cid:48) ( x ) , ν N ) = min (cid:110) N (cid:88) i,j =1 κ i,j (cid:107) w (cid:48) ( i ) x − ˆ w ( j ) (cid:107) | N (cid:88) j =1 κ i,j = 1 N , i = 1 , . . . , N, N (cid:88) i =1 κ i,j = 1 N , j = 1 , . . . , N (cid:111) ≤ N N (cid:88) i =1 (cid:107) w (cid:48) ( i ) x − ˆ w ( i ) (cid:107) . Therefore, we have E γ (cid:48) ( x ) (cid:2) c ( x , u (cid:63) ) − λW ( γ (cid:48) ( x ) , ν N ) − αv ( f ( x , u (cid:63) , w )) (cid:3) ≥ c ( x , u (cid:63) ) − λ N (cid:88) i =1 (cid:107) w (cid:48) ( i ) x − ˆ w ( i ) (cid:107) − αN N (cid:88) i =1 v ( f ( x , u (cid:63) , w (cid:48) ( i ) x ))= c ( x , u (cid:63) ) + 1 N N (cid:88) i =1 sup w ∈ R l (cid:2) αv ( f ( x , u (cid:63) , w )) − λ (cid:107) w − ˆ w ( i ) (cid:107) (cid:3) , w (cid:48) ( i ) x ’s. On the other hand, it follows fromProposition 6 that sup µ ∈P ( R l ) (cid:2) c ( x , u (cid:63) ) − λW ( µ , ν N ) − αv ( f ( x , u (cid:63) , w )) (cid:3) = c ( x , u (cid:63) ) + 1 N N (cid:88) i =1 sup w ∈ R l (cid:2) αv ( f ( x , u (cid:63) , w )) − λ (cid:107) w − ˆ w ( i ) (cid:107) (cid:3) . Thus, we conclude that γ (cid:48) ( x ) is one of the worst-case distributions.We now consider the case in which the data samples ˆ w ( i ) ’s have non-zero mean, i.e.,¯ w := E w ∼ ν N [ w ] = 1 N N (cid:88) i =1 ˆ w ( i ) (cid:54) = 0 . The linear system (5.1) can be rewritten as x t +1 = Ax t + Bu t + Ξ w (cid:48) t + Ξ ¯ w, where w (cid:48) t := w t − ¯ w . We now normalize the data samples ˆ w (cid:48) ( i ) := ˆ w ( i ) − ¯ w for all i ∈ I so that1 N N (cid:88) i =1 ˆ w (cid:48) ( i ) = 0 . Let ¯ x := ( I − A ) − Ξ ¯ w assuming it is well-deﬁned. Then, (cid:20) x t +1 − ¯ x (cid:21) = (cid:20) A

00 1 (cid:21) (cid:20) x t − ¯ x (cid:21) + (cid:20) B (cid:21) u t + (cid:20) Ξ0 (cid:21) w (cid:48) t . By letting x (cid:48) t := (( x t +1 − ¯ x ) (cid:62) , (cid:62) ∈ R n +1 , we can rewrite the system as x (cid:48) t +1 = A (cid:48) x t + B (cid:48) u t + Ξ (cid:48) w (cid:48) t . Deﬁne a positive semideﬁnite matrix Q (cid:48) ∈ R ( n +1) × ( n +1) by Q (cid:48) := (cid:2) I ¯ x (cid:3) (cid:62) Q (cid:2) I ¯ x (cid:3) = (cid:20) Q Q ¯ x ¯ x (cid:62) Q ¯ x (cid:62) Q ¯ x (cid:21) . We then have x (cid:62) t Qx t = x (cid:48)(cid:62) t Q (cid:48) x (cid:48) t . Thus, the nonzero mean case is converted to the zero mean case with the normalized data ˆ w (cid:48) ( i ) ’s,the expanded state x (cid:48) t and the new positive semideﬁnite matrix Q (cid:48) in the quadratic cost function.Therefore, we can use Theorem 4 to compute the DR-control gain matrix K (cid:48) . The correspondingoptimal policy is obtained as π (cid:48) ( x ) := K (cid:48) (( x − ¯ x ) (cid:62) , (cid:62) for all x ∈ R n .7 References [1] A. Nilim and L. El Ghaoui, “Robust control of Markov decision processes with uncertaintransition matrices,”

Oper. Res. , vol. 53, no. 5, pp. 780–798, 2005.[2] S. Samuelson and I. Yang, “Data-driven distributionally robust control of energy storage tomanage wind power ﬂuctuations,” in

Proceedings of the 1st IEEE Conference on Control Tech-nology and Applications , 2017.[3] P. Mohajerin Esfahani and D. Kuhn, “Data-driven distributionally robust optimization usingthe Wasserstein metric: Performance guarantees and tractable reformulations,”

Math. Pro-gram. , vol. 171, no. 1–2, pp. 115–166, 2018.[4] C. Zhao and Y. Guan, “Data-driven risk-averse stochastic optimization with Wasserstein met-ric,”

Oper. Res. Lett. , vol. 46, no. 2, 2018.[5] R. Gao and A. J. Kleywegt, “Distributionally robust stochastic optimization with Wassersteindistance,” arXiv:1604.02199 , 2016.[6] J. Blanchet, K. Murthy, and F. Zhang, “Optimal transport based distributionally robust op-timization: Structural properties and iterative schemes,” arXiv:1810.02403 , 2018.[7] A. Sinha, H. Namkoong, and J. Duchi, “Certifying some distributional robustness with prin-cipled adversarial training,” in

International Conference on Learning Representations , 2018.[8] R. Chen and I. C. Paschalidis, “A robust learning approach for regression models based ondistributionally robust optimization,”

Journal of Machine Learning Research , pp. 1–48, 2018.[9] S. Shaﬁeezadeh-Abadeh, V. A. Nguyen, D. Kuhn, and P. Mohajerin Esfahani, “Wassersteindistributionally robust Kalman ﬁltering,” in

Neural Information Processing Systems , 2018.[10] I. Popescu, “Robust mean-covariance solutions for stochastic optimization,”

Oper. Res. , vol. 55,no. 1, pp. 98–112, 2007.[11] E. Delage and Y. Ye, “Distributionally robust optimization under moment uncertainty withapplication to data-driven problems,”

Oper. Res. , vol. 58, no. 3, pp. 595–612, 2010.[12] S. Zymler, D. Kuhn, and B. Rustem, “Distributionally robust joint chance constraints withsecond-order moment information,”

Math. Program., Ser. A , vol. 137, pp. 167–198, 2013.[13] W. Wiesemann, D. Kuhn, and M. Sim, “Distributionally robust convex optimization,”

Oper.Res. , vol. 62, no. 6, pp. 1358–1376, 2014.[14] A. Ben-Tal, D. Den Hertog, A. De Waegenaere, B. Melenberg, and G. Rennen, “Robustsolutions of optimization problems aﬀected by uncertain probabilities,”

Manage. Sci. , vol. 59,no. 2, pp. 341–357, 2013.[15] R. Jiang and Y. Guan, “Data-driven chance constrained stochastic program,”

Math. Program.,Ser. A , vol. 158, pp. 291–327, 2016.[16] H. Xu and S. Mannor, “Distributionally robust Markov decision processes,”

Math. Oper. Res. ,vol. 37, no. 2, pp. 288–300, 2012.8[17] P. Yu and H. Xu, “Distributionally robust counterpart in Markov decision processes,”

IEEETrans. Autom. Control , vol. 61, no. 9, pp. 2538–2543, 2016.[18] I. Yang, “A convex optimization approach to distributionally robust Markov decision processeswith Wasserstein distance,”

IEEE Control Syst. Lett. , vol. 1, no. 1, pp. 164–169, 2017.[19] J. I. Gonz´alez-Trejo, O. Hern´andez-Lerma, and L. F. Hoyos-Reyes, “Minimax control ofdiscrete-time stochastic systems,”

SIAM J. Control Optim. , vol. 41, no. 5, pp. 1626–1659,2003.[20] B. P. G. Van Parys, D. Kuhn, P. J. Goulart, and M. Morari, “Distributionally robust controlof constrained stochastic systems,”

IEEE Trans. Autom. Control , vol. 61, no. 2, pp. 430–442,2016.[21] I. Yang, “A dynamic game approach to distributionally robust safety speciﬁcations for stochas-tic systems,”

Automatica , vol. 94, pp. 94–101, 2018.[22] I. Tzortzis, C. D. Charalambous, and T. Charalambous, “Inﬁnite horizon average cost dynamicprogramming subject to total variation distance ambiguity,”

SIAM J. Control Optim. , vol. 57,no. 4, pp. 2843–2872, 2019.[23] I. Tzortzis, C. D. Charalambous, T. Charalambous, C. K. Kourtellaris, and C. N. Hadjicostis,“Robust linear quadratic regulator for uncertain systems,” in

Proc. 55th IEEE Conf. Decis.Control , 2016.[24] I. R. Petersen, M. R. James, and P. Dupuis, “Minimax optimal control of stochastic uncertainsystems with relative entropy constraints,”

IEEE Trans. Autom. Control , vol. 45, no. 3, pp.398–412, 2000.[25] I. Yang, “Distributionally robust stochastic control with conic conﬁdence sets,” in

Proc. 56thIEEE Conf. Decis. Control , 2017.[26] H. U. K¨uenle, “Stochastic games with complete information and average cost criteria,” in

Advances in Dynamic Games and Applications . Birkh¨auser, 2000, pp. 325–338.[27] R. Reemtsen, “Discretization methods for the solution of semi-inﬁnite programming problems,”

J. Optim. Theory Appl. , vol. 71, no. 1, pp. 85–103, 1991.[28] R. Hettich and K. O. Kortanek, “Semi-inﬁnite programming: Theory, methods, and applica-tions,”

SIAM Rev. , vol. 35, no. 3, pp. 380–429, 1993.[29] M. L´opez and G. Still, “Semi-inﬁnite programming,”

Eur. J. Oper. Res. , vol. 180, pp. 491–518,2007.[30] G. Calaﬁore and M. C. Campi, “Uncertain convex programs: randomized solutions and conﬁ-dence levels,”

Math. Program., Ser. A , vol. 102, pp. 25–46, 2005.[31] N. Saldi, T. Linder, and S. Y¨uksel,

Finite Approximations in Discrete-Time Stochastic Control:Quantized Models and Asymptotic Optimality . Birkh¨auser, 2018.[32] D. P. Bertsekas,

Dynamic Programming and Optimal Control, , 4th ed. Athena Scientiﬁc,2012, vol. 2.9[33] M. L. Puterman and M. C. Shin, “Modiﬁed policy iteration algorithms for discounted Markovdecision problems,”

Management Science , vol. 24, no. 11, pp. 1127–1137, 1978.[34] M. L. Puterman,

Markov Decision Processes: Discrete Stochastic Dynamic Programming .John Wiley & Sons, 2014.[35] J. E. Smith and R. L. Winkler, “The optimizer’s curse: Skepticism and postdecision surprisein decision analysis,”

Manage. Sci. , vol. 52, no. 3, pp. 311–322, 2006.[36] N. Fournier and A. Guillin, “On the rate of convergence in Wasserstein distance of the empiricalmeasure,”

Probab. Theory Relat. Fields , vol. 162, no. 3–4, pp. 707–738, 2015.[37] K. Kim and I. Yang, “Minimax control of ambiguous linear stochastic systems using theWasserstein metric,” in

Proc. 59th IEEE Conf. Decis. Control , 2020.[38] F. L. Lewis, D. Vrabie, and V. L. Syrmos,

Optimal Control . John Wiley & Sons, 2012.[39] P. A. Samuelson, “Lifetime portfolio selection by dynamic stochastic programming,”

Rev.Econ. Stat. , vol. 51, no. 3, pp. 239–246, 1969.[40] N. H. Hakansson, “Optimal investment and consumption strategies under risk for a class ofutility functions,”

Econometrica , vol. 38, no. 5, pp. 587–607, 1970.[41] I. Yang, “A convex optimization approach to dynamic programming in continuous state andaction spaces,”

J. Optimiz. Theory App. , vol. 187, pp. 133–157, 2020.[42] A. R. Bergen and V. Vittal,

Power Systems Analysis . Pearson, 1999.[43] F. D¨orﬂer and F. Bullo, “Novel insights into lossless AC and DC power ﬂow,” in

Proceedingsof the 2013 Power and Energy Society General Meeting , 2013.[44] G. Fazelnia, R. Madani, A. Kalbat, and J. Lavaei, “Convex relaxation for optimal distributedcontrol problems,”

IEEE Transactions on Automatic Control , vol. 62, no. 1, pp. 206–221, 2017.[45] F. D¨orﬂer, M. R. Jovanovi´c, M. Chertkov, and F. Bullo, “Sparsity-promoting optimal wide-area control of power networks,”

IEEE Transactions on Power Systems , vol. 29, no. 5, pp.2281–2291, 2014.[46] R. D. Zimmerman, C. E. Murillo-S´anchez, and R. J. Thomas, “MATPOWER: Steady-stateoperations, planning, and analysis tools for power systems research and education,”